| |
|
|
||||
![]() |
||||||
| |
|
|||||
|
Procedural Rendering on Playstation 2 TenThings Nobody Told You About PS2 Before we start translating this Lifeform algorithm into a PS2 friendly design, I’d like to cover some more about the PS2. Later we’ll use these insights to re-order and tweak the algorithm to get some speed out of the machine. 1. You
must design before coding. 2. The
compiler doesn’t make things easy for you. Additionally the inline assembler doesn’t have enough information to make good decisions about uploading and downloading VU0 Macro Mode registers and generating broadcast instructions on vector data types. There is a patch for the ee-gcc compiler by Tyler Daniel and Dylan Cuthbert that update the inline assembler to add register naming, access to broadcast fields and new register types which are used to good effect in our C++ Vector classes. It’s by no means perfect as you’re still limited to only 10 input and output registers, but it’s a significant advance. 3. All
the hardware is memory mapped. Much of this confusion comes from the lack of a detailed memory map of the PS2 in the documentation. Understandably, the designers were reticent to provide one as the machine was in flux at the time of writing (the memory layout is completely reconfigurable by the kernel at boot time) and they were scared of giving programmers bad information. Let’s change this. All outboard registers and memory areas are freely accessible at fixed addresses. Digging through the headers you will come across a header called eeregs.h that holds the key. In here are the hard-coded addresses of most of the internals of the machine. First a note here about future proofing your programs. Accessing these registers directly in final production code is not advisable as it’s fully possible that the memory map could change with future versions of the PS2. These techniques are only outlined here for tinkering around and learning the system so you can prove to yourself there’s no magic here. Once you have grokked how the PS2 and the standard library functions work, it’s safest to stick to using the libraries. Let’s take a look at a few of the values in the header and see what they mean: #define
VU1_MICRO ((volatile u_long *)(0xNNNNNNNN))
These two addresses are the start addresses of VU1 program and data memory if VU1 is not currently caclulating. Most tutorials paint VU1 as “far away”, a hands off device that’s unforgiving if you get a single instruction wrong and consequently hard to debug. Sure, the memory is unavailable if VU1 is running a program, but using these addresses you can dump the contents before and after running VU programs. Couple this knowledge with the DMA Disassembler and VCL, the vector code compiler, and VU programming without expensive proprietary tools and debuggers is not quite as scary as it seems.
If you have only read the SCE libraries you may be under the impression that “Getting a DMA Channel” is an arcane and complicated process requiring a whole function call. Far from it. The DMA channels are not genuine programming abstractions, in reality they’re just a bank of memory mapped registers. The entries in the structure sceDmaChan map direc tly onto these addresses like a cookie cutter. #define GIF_FIFO ((volatile u_long128 *)(0xNNNNNNNN)) The GIF FIFO is the doorway into the Graphics Synthesizer. You push qwords in here one after another and the GS generates polygons - simple as that. No need to use DMA to get your first program working, just program up a GIF Tag with some data and stuff it into this address. This leads me to my favorite insight into the PS2… 4. The
DMAC is just a Pigeon Hole Stuffer. 5. Myth:
VU code is hard. VCL (Vector Command Line, as opposed to the interactive graphic version) is a tool that preprocesses a single stream of VU code (no paired instructions necessary), analyses it for loop blocks and control flow, pairs and rearranges instructions, opens loops and interleaves the result to give pretty efficient code. For example, take this simplest of VU programs that takes a block of vectors and in-place matrix multiplies them by a fixed matrix, divides by W and integerizes the value: ; test.vcl VCL takes the source code, pairs the instructions and unwrap the loop to this seven instruction inner loop (with entry and exit blocks not shown):
6. Myth:
Synchronization is complicated. The first point to make is that complicated as general purpose synchronization is, when we are rendering to screen we are dealing with a more limited problem: you only need to keep things in sync once a frame. All your automatic processes can kick off and be fighting for resources during a frame, but as soon as you reach the end of rendering the frame then everything must be finished. You are only dealing with short bursts of synchronization. The PS2 has three main systems for synchronization:
This whole area is worthy of a paper in itself as much of this information is spread around the documentation. Breaking the problem down into these three areas sheds allows you to grok the whole system. Briefly summarizing: Within the EE Core we have sync.l and sync.e instructions that guarantee that results are finished before continuing with execution. Between the EE Core and external devices (VIF, GIF, DMAC, etc) we have a variety of tools. Many events can generate interrupts upon completion, the VIF has a mark instruction that sets the value of a register that can be read by the EE Core allowing the EE Core to know that a certain point has been reached in a DMA stream and we have the memory mapped registers that contain status bits that can be polled. Between external devices there is a well defined set of priorities that cause execution orders to be well defined. The VIF can also be forced to wait using flush, flushe and flusha instructions. These are the main ones we’ll be using in this tutorial. 7. Myth:
Scratchpad is for speed. You could think of scratchpad as a fast area of memory, like the original PSX, but real world timings show that it’s not that much faster than Uncached Accelerated memory for sequential work or in-cache data for random work. The best way to think of SPR is as a place to work while the data bus is busy - something like a playground surrounded by roads with heavy traffic. Picture this: Your program has just kicked off a huge DMA chain of events that will automatically upload and execute VU programs and move information through the system. The DMAC is moving information from unit to unit over the Memory Bus in 8-qword chunks, checking for interruptions every tick and CPU has precedence. The last thing the DMAC needs is to be interrupted every 8 clock cycles with the CPU needing to use the bus for more data. This is why the designers gave you an area of memory to play with while this happens. Sure, the Instruction and Data caches play their part but they are primarily there to aid throughput of instructions. Scratchpad is there to keep you off the data bus – use it to batch up memory writes and move the data to main memory using burst-mode DMA transfers using the fromSPR DMA channel. 8. There
is no such thing as “The Pipeline”. As each renderer is less than 16KB of VU code they are very cheap to upload compared to the amount of polygon data they will be generating. Program uploads can be embedded inside DMA chains to complete the automation process, e.g.
9. Speed
is all about the Bus. 150 million verts per second = 2.5 million verts / frame at 60Hz Given that each of these polygons will be flat shaded the result isn’t very interesting. We will need to factor in a perspective transform, clipping and lighting which are done on the VUs, which run at 300MHz. The PS2 FAQ says these operations can take 15 – 20 cycles per vertex typically, giving us a throughput of:
5 million verts / 20 cycles per vertex Notice the difference here. Just by removing five cycles per vertex we get a huge increase in output. This is the reason we need different renderers for every situation – each renderer can shave off precious cycles-per-vertex by doing only the work necessary. This is also the reason we have two VUs – often VU1 is often described as the “rendering” VU and VU0 as the “everything else” renderer, but this is not necessarily so. Both can be transforming vertices but only one can be feeding the GIF, and this explains the Memory FIFO you can set up: one VU is feeding the GS while the other is filling the FIFO. It also explains why we have two rendering contexts in the GS, one for each of the two input streams. 10. There
are new tools to help you. DMA Disassembler. This tool, from SCEE’s James Russell, takes a completes DMA packet, parses it and generates a printout of how the machine will interpret the data block when it is sent. It can report errors in the chain and provides an excellent visual report of your DMA chain. Packet Libraries. Built by Tyler Daniel, this set of classes allows easy construction of DMA packets, either at fixed locations in memory or in dynamically allocated buffers. The packet classes are styled after insertion-only STL containers and know how to add VIF tags, create all types of DMA packet and will calculate qword counts for you.
Vector Libraries and GCC patch. The GCC inline assembler patch adds a number of new features to the inline assembler:
The patch is not perfect as there is still a limit to 10 input and output registers per section of inline assembly, and that can be a little painful at times (i.e. three operand 4x4 matrix operations like a = b * c take 12 registers to declare), but it is at least an improvement. The Matrix and Vector classes showcase the GCC assembler patch, providing a set of template classes that produce fairly optimized results, plus being way easier to write and alter to your needs: vec_x dot( const vec_xyz rhs ) const VU Command Line preprocessor. As mentioned earlier, one of the newest tools to aid PS2 programming is VCL, the vector code optimizing preprocessor. It takes a single stream of VU instructions and:
No more writing VU code in Excel! It outputs pretty well optimized results that can be used as a starting point for hand coding (It can also be run on already existing code to see if any improvements can be made). VCL is not that intelligent yet (it will happily optimize complete rubbish). For the best results it’s worth learning how to code in a VCL friendly style, e.g.:
More of these techniques are in the VCL documentation. It’s really satisfying to be able to cut and paste blocks of code together to get the VU program you need and not need to worry about pairing instructions and inserting nops. Built-in Profiling Registers. The EE Core, like all MIPS processors, has a number of performance registers built into Coprocessor 0 (the CPU control unit). The PerfTest class reads these registers and can print out a running commentary on the efficiency of any section of code you want to sample. Performance Analyzer. SCE have just announced it’s hardware Performance Analyzer (PA). It’s a hardware device that samples activity on the busses and produces graphs, analysis and in depth insights into your algorithms. Currently the development support offices are being fitted with these devices and your teams will be able to book consultation time with them. ______________________________________________________ |
||||||||||||||||||||||||
|
|