|
Features

Procedural
Rendering on Playstation 2
VU Dataflow
Designs
In
this section we’ll look behind some of the design decisions we’ll
need to make when translating the Lifeform algorithm to PS2. The
first decision is where to draw the line between CPU and VU calculations.
There are several options, in order of difficulty:
- The
main program calculates an abstract list of triangle strips for
the VU to transform and light.
- The
main program calculates transform and lighting matrices each primitive
(sphere, torus). The VU is passed a static description of an example
primitive in object space which the VU transforms into world space,
lights and displays.
- The
main program sends the VU a description of a single horn plus
an example primitive in object space. The VU iterates along the
horn, calculating the transform and lighting matrices for each
rib and instances the primitives into world space using these
matrices.
-
The main program sends the VU a numerical description of the entire
model, the VU does everything else.
For
this tutorial I chose the second option as a halfway house between
full optimization and simple triangle lists. Following down this
path is not too far removed from ordinary engine programming and
it leaves the final program wide open for further optimizations.
Keep
Your Eye On The Prize
Before setting out it’s useful to remember what the ultimate aim
of the design is - to send correct GS Tags and associated data to
the GS. A GIF Tag contains quite a bit of information, but only
a few of them are important.
GIF
Tags are the essence of PS2 graphics programming. They tell the
GS how much data to expect and in what format and they contain all
the necessary information for an xgkick instruction to transfer
data from VU1 to the GS automatically. Create the correct GIF Tag
for your set of N vertices and everything else is pretty much automatic.
The
complications in PS2 programming arise when you start to work out
how to get the GIF Tags to the GS. There are three routes into the
GS. One is direct and driven by DMA (Path 3), one is indirect and
goes via VIF1 (Path 2) and is useful for redirecting part of a DMA
stream to the GS and the third is very indirect (Path 1) and requires
you to DMA data into VU1 memory and executed a program that ends
in a xgkick.
Choosing
a Data Flow Design
The first problem is to choose a dataflow design. The next few pages
contain examples of basic data flows that can be used as starting
points for your algorithms. Each dataflow is described two ways
– once using it’s physical layout in memory showing where data moves
to and from, and once as a time line showing how stall conditions
and VIF instructions control the synchronization between the different
pieces of hardware.
The
diagrams are pretty abstract in that they only outline the data
movement and control signals necessary but don’t show areas for
constants, precalculated data or uploading the VU program code.
We’ll be covering all these details later when we go in-depth into
the actual algorithm used to render the Lifeform primitives.
The
other point to note is that none of these diagrams take into consideration
the effect of texture uploads on the rendering sequence. This is
a whole other tutorial for another day...
Single
Buffer
Single buffering takes one big buffer of source data and processes
it in-place. After processing the result is transferred to the GS
with an xgkick, during which the DMA stream has to wait for completion
of rendering before uploading new data for the next pass.
Benefits:
Process a large amount of data in one go.
Drawbacks: Processing is nearly
serial – DMA, VU and GS are mostly left waiting.
First,
the VIF unpacks a chunk of data into VU1 Memory (unpack).
Next
the VU program is called to process the data (mscal).
When the data has been processed the result is transferred from
VU1 Memory to the GS by an xgkick
command. Because the GIF is going to be reading the transformed
data from the VU memory we can’t upload more data until the xgkick
has finished, hence the need for a flush.
(there are three VIF flush
commands flush, flushe
and flusha, where
flush waits for the
end of both the VU program and the data transfer to the GS.)
When
the flush returns
the process loops.
Double
Buffer
Double Buffering speeds up the operation by allowing you to upload
new data simultaneously with rendering the previous buffer.
Benefits:
Uploading in parallel with rendering.
Works
with Data Amplification where the VU generates more data than uploaded,
e.g. 16 verts expanded into a Bezier Patch. Areas A and B do not
need
to be the same size.
Drawbacks: Less data per chunk.
Although
more parallel, VU calculation is still serialized. Data is unpacked
to area A. DMA then waits for the VU to finish transferring buffer
B to the GS with a flush (for the first iteration this should return
immediately).
The
VU then processes buffer A into buffer B (mscal)
while the DMA stream waits for the program to finish (flushe).
When the program has finished processing buffer A the DMA is free
to upload more data into it, while simultaneously buffer B is being
transferred to the GS via Path 1 (xgkick).
The DMA stream then waits for buffer B to finish being transferred
(flush) and the process
loops back to the beginning.
Quad
Buffer
Quad buffering is the default choice for most PS2 VU programs.The
VU memory is split into two areas of equal size, each area double
bhuffered. When set up correctly, the TOP and TOPS VU registers
will automatically transfer data to the correct buffers.
Benefits:
Good use of parallelism – uploading, calculating and rendering
all take place simultaneously,
much like a RISC instruction pipeline.
Works well with the double buffering registers TOP and TOPS, which
may have caching
advantages.
The
best technique for out-of-place processing of vertices or data amplification.
Drawbacks:
Data can only be processed in <8KB chunks.
There
are three devices accessing the same area of memory at the same
time –
VU, VIF and GIF. The VU has read/write priority (at 300MHz) over
the GIF (150MHz)
which has priority over the VIF (150MHz). Higher priority devices
cause
lower priority devices to stall if there is any contention meaning
there are
hidden wait-states in this technique.
First,
the DMA stream sets the base and offset for double buffering – usually
the base is 0 and the offset is half of VU1 memory, 512 quads.
The
data is uploaded into buffer A (unpack), remembering to use the
double buffer offset. The program is called (mscal) which swaps
the TOP and TOPS registers, so any subsequent unpack instructions
will be directed to buffer C.
The
DMA stream then immediately unpacks data to buffer C and attempts
to execute another mscal. This instruction cannot be executed as
the VU is already running a program so the DMA stream will stall
until the VU has finished processing buffer A into B.
When
the VU has finished processing, the mscal will succeed causing the
TOP and TOPS registers to again be swapped. The VU program will
begin to process buffer C into D while simultaniously transferring
buffer B to the GS.
This
process of stalls and buffer swaps continues until all VIF packets
have been completed.
Triple
Buffer
Another pipeline technique for parallel uploading, calculation and
rendering, this technique relies on in-place processing of vertices.
Benefits:
All the benefits of quad buffering with larger buffer sizes.
Best
technique for simple in-place transform and lighting of precalculated
vertices.
Drawbacks:
Cannot use TOP and TOPS registers – you must handle all offsets
by hand and
remember which buffer to use between VU programs.
Three
streams of read/writes again introduce hidden wait states.
Data
is transferred directly to buffer A (all destination pointers must
be handled directly by the VIF codes – TOP and TOPS cannot be used)
and processing is started on it.
Simultaneously,
data is transferred to buffer B and another mscal is attempted.
This will stall until processing of buffer A is finished.
Processing
on Buffer B is started while buffer A is being rendered (xgkick).
Meanwhile buffer C is being uploaded. The three-buffer pipeline
continues to rotate A->B->C until all VIF packets are completed.
Parallel
Processing
This technique is the simplest demonstration of how to use the PS2
to it’s maximum – all units are fully stressed. All the previous
techniques have used just one VU for processing and the GS has been
left waiting for more data to render. In this example we use precalculated
buffers of GIF tags and data to fill the gaps in GS processing,
at the cost of large amounts of main memory. Many of the advanced
techniques on PS2 are variations optimized to use less memory.
Benefits:
All units are fully stressed.
VU1
can be using any of the previous techniques for rendering.
Drawbacks: Moving data from VU0 to Scratchpad efficiently
is a complex issue.
Large amounts of main memory are needed as buffers.
With
VU1 running one of the previous techniques (e.g. quad buffering),
the gaps in GS rendering are filled by a Path 3 DMA stream of GIF
tags and data from main memory. Each of the GIF tags must be marked
EOP=1 (end of primitive) allowing VU1 to interrupt the GIF tag stream
at the end of any primitive in the stream.
Data
is moved from Scratchpad (SPR) to the next-frame buffer using burst
mode. Using slice mode introduces too many delays in bus transfer
as the DMAC has to arbitrate between three different streams. Better
to allow the SPR data to hog the bus for quick one-off transfers.
Note
in the diagram how both the VIF1 mscal
and the VU1 xgkick
instructions are subject to stalls if the receiving hardware is
not ready for the new data.
______________________________________________________
|