It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.    

Search articles, jobs, buyers guide, and more.

By Robin Green
Gamasutra
[Author's Bio]
September 26, 2001

Lifeforms

The Transformation List

TenThings Nobody Told You About PS2

VU Dataflow Designs

The Design

Printer Friendly Version

 

Letters to the Editor:
Write a letter
View all letters


Features

Procedural Rendering on Playstation 2

The Design

The Lifeform program has to instance many copies of the same primitive, each one with a different object to world transformation matrix. To achieve this efficiently we can use a variation on the Triple buffering system. This explanation will add a few more practical detail than the earlier dataflow examples.

First, we unpack the data for an example primitive into buffer A. This data packet contains everything we need to render a single primitive – GIF tag and vertices – except the vertices are all in object space. We will need to transform the verts to world space and calculate RGB values for the vertices for Gouraud shading.


Timing diagram for the Lifeform renderer.

Next we upload an object-to-screen transformation matrix which is the concatenation of:

     camera-screen * world-camera * object-world

where the object-world matrix was calculated by the Horn algorithm. Multiplying the object space vertices with this matrix will transform them directly to screen space ready for conversion to raster space.

We then execute the VU program which transforms the object space verts in buffer A to buffer B and xgkicks them.

Simultaneously we upload a new header to VU memory and attempt to start processing buffer B with an mscal, stalling until the VU has finished processing.

Here is a diagram of the data flow rendering two horns, the first a horn of three torii and the second a horn of two spheres. Due to the first torus, say, taking up a lot of screen space, it causes the xgkick of the second torus to wait until rendering is complete:

Untransformed Primitives
Because the models are going to be procedural, we have to calculate instances of the untransformed primitives to upload when they are needed. It would be useful if we could package the data up into a single DMA packet that could just be referenced (like a procedure call) rather than copying them every time they are needed. Using the DMA tags call and ret we can do just that.

We want the spheres and toruses to render as quickly as possible. Here we come to a dilemma. In PC graphics we are always told to minimize the bus bandwidth and use indexed primitives. It turns out that VU code is not well suited to indirection – it’s optimized for blasting through VU instructions linearly. The overhead of reordering indexed primitives far outweighs the benefits in bus upload speed so, at least for this program, the best solution is just to produce long triangle strips.

A torus can simply be represented as one single long strip if we are allowed to skip rendering certain triangles. The ADC bit in can achieve this – we pass the ADC bit to the VU program in the w component of our surface normals, but you could just as easily pass it in the lowest bit of any 32-bit float. The accuracy is almost always not required.

Spheres cannot be properly described by a single tristrip without duplicating a lot of vertices and edges. Here I opted to produce spheres as (in this order) two trifans (top and bottom) plus one tristrip for the inner “bands” of triangles. This is the reason we have two VU programs – the sphere must embed three GIF tags in it’s stream rather than just one for the torus.

Tristrip vertex orders fro 6x4 torus and a 6x4 sphere.

We are free to calculate the vertices as for an indexed primitive, but they must be stored as triangle strips in a DMA packet. Here is the code used to do the conversion for the torus and there is very similar code for the sphere.

First we dynamically create a new DMA packet and the GIF tag:

vif_packet = new CVifSCDmaPacket( (512,
DMAC::Channels::vif1,
Packet::kDontXferTags,
Core::MemMappings::Normal );

Then we setup the pointers to our indexed vertices. Vertices are in vertex_data[] , normals are in normal_data[] and indices are in strip_table[].

uint128 *vptr = (uint128 *)vertex_data;
uint128 *nptr = (uint128 *)normal_data;
uint *strip = strip_table;
uint n = 0;

Finally we loop over the vertex indices to produce a single packet of vertices and normals that can be used to instance toruses:



VU Memory Layout

In order to actually write VU program to run this algorithm, we need to first block out the VU memory map so we can work out where to upload the matrices using VIF unpack commands.

Here is the layout I used. The VU memory is broken into five areas – the constants, the header, the untransformed vertices with GIF tags (buffer A) and the two buffers for transformed vertices (buffers B & C).

Constants. The constants for rendering a horn are a 3x3 light direction matrix for parallel lighting, and a 3x4 matrix of light colors (three RGB lights plus ambient).

The light direction matrix is a transposed matrix of unit vectors allowing lighting to be calculated as a single 3x3 matrix multiply. To light a vertex normal we have to calculate the dot product between the surface normal and direction to the light source. In math, the end result looks like this:

color = Ksurface * ( Ilight * N.L )
 = Kr * ( Ir * N.L )
   Kg * ( Ig * N.L )
   Kb * ( Ib * N.L )

where N is the unit surface normal, I is illumination and K is a reflectance function (e.g. the surface color).

Because the VU units don’t have a special dot product instruction we have to piece this together out of multiplies and adds. It turns out that doing three dot products takes the same time as doing one so we may as well use three light sources:

color = Ksurface * Sum(n, In * N.Ln)
 = Kr * (I0,r*N.L0 + I1,r*N.L1 + I2,r*N.L2 )
   Kg * (I0,r*N.L0 + I1,r*N.L1 + I2,r*N.L2 )
   Kb * (I0,r*N.L0 + I1,r*N.L1 + I2,r*N.L2 )

So, first we calculate the dot products into a single vector – this only works because our light vectors are stored in a transposed matrix:

               x    y    z   w
lighting0 = | L0.x L1.x L2.x 0 |
lighting1 = | L0.y L1.y L2.y 0 |
lighting1 = | L0.z L1.z L2.z 0 |

     
N.L0 =
N.L1 =
N.L2 =
L0.x * N.x
L1.x * N.x
L2.x * N.x
+ L0.y * N.y
+ L1.y * N.y
+ L2.y * N.y
+ L0.z * N.z
+ L1.z * N.z
+ L2.z * N.z
mulax.xyz … madday.xyz … maddz.xyz …

 

Then we multiply through by the light colors to get the final vertex color:

r =
g =
b =
I0r * N.L0
I0g * N.L0
I0b * N.L0
+ I1r * N.L1
+ I1g * N.L1
+ I1b * N.L1
+ I2r * N.L2
+ I2g * N.L2
+ I2b * N.L2
+ 1.0 * Ar
+ 1.0 * Ag
+ 1.0 * Ab
mulax.xyz … madday.xyz … maddaz.xyz … maddw.xyz …


Header. The header contains the additional information needed to render this particular instance of the primitive – a 4x4 object-screen matrix, a 3x3 matrix to transform the normals into world space for lighting calculations, the surface color for this primitive, the address of the input matrix and the address of where to put the transformed vertices.

All this information is calculated during the previous frame, embedded in a DMA packet and uploaded once per primitive at rendering time.

Untransformed Vertices. After the header is stored a GIF Tag (from which we can work out the number of vertices in the packet) and the untransformed vertices and normals.

The VU Program
Now we have all the information needed to design the VU program. We know where the matrices are going to be stored, we know where to get our GIF tags from and we know how many verts need to be transformed for each primitive (it’s the last 8 bits of the GIF tag). We will need to:

  • Transform the vertex to screen space, divide by W and covert the result to an integer.
  • Transform the normal into light space (a 3x3 matrix calculated by the horn function).
  • Calculate N.L and multiply it by the light colors.
  • Multiply the resulting light intensity by the surface color.
  • Store the results in the output buffer and loop.

VCL compiles the program resulting in an inner loop of 22 cycles. This can be improved (see later) but it’s not bad for so little effort.

Running Order

The first job the program has is to upload the VU programs to VU Program Memory. There are two programs in the packet, one for transforming and lighting toruses and one for transforming and
lighting spheres, positioned one after the other. Uploading is achieved using an mpg VIF Tag that loads the program starting at a specified location in VU Program Memory.

A short script generates an object file that can be linked into your executable, and also defines four global variables for you to use as extern pointers. vu1_packet_begin and vu1_packet_end allow you to get the starting address and (should you want it) the length of the of the DMA packet. torus_start_here and sphere_start_here are the starting addresses of the two programs relative to the start of VU Program Memory. You can use these values for the mscal VIF instruction.


void upload_vu1_programs()
{
     CSCDmaPacket vu1_upload((uint128*)






 vu1_upload.Send();
}


(&vu1_packet_begin),
((uint32)vu1_packet_end) / 16,
DMAC::Channels::vif1,
Packet::kXferTags,
Core::MemMappings::Normal,
Packet::kFull );

The program then enters it’s rendering loop. The job of the rendering loop is to render the previous frame and calculate the DMA packets for the next frame. To do this we define two global DMA lists in uncached accelerated main memory:

CVifSCDmaPacket *packet = new CVifSCDmaPacket(80000,
 DMAC::Channels::vif1,
 Packet::kDontXferTags,
 Core::MemMappings::UncachedAccl );
CVifSCDmaPacket*last_packet = new CVifSCDmaPacket(80000,
 DMAC::Channels::vif1,
 Packet::kDontXferTags,
 Core::MemMappings::UncachedAccl );

For the first iteration we fill the previous frame with an empty DMA tag so that it will do nothing.

    last_packet->End();
   last_packet->CloseTag();

From this point on all data for the next frame gets appended to
the end of the current DMA packet called, usefully, packet.

The next job is to upload the constants. This is done once per frame, just in case you want to animate the lighting for each render. Also in this packet we set up the double buffering base and offset.

  void upload_light_constants(CVifSCDmaPacket *packet,   mat_44 &direction, mat_44 &color)
  {
    // upload light constants at location 0 in VU1 memory
    packet->Cnt();
    {
      packet->Base(349);
      packet->Offset(337);
      packet->Stcycl(1,1);
      packet->OpenUnpack(Vifs::UnpackModes::v4_32, 0, Packet::kSingleBuff);
      {
        packet->Add(direction.get_col0());
        packet->Add(direction.get_col1());
        packet->Add(direction.get_col2());
        packet->Add(color.get_col0());
        packet->Add(color.get_col1());
        packet->Add(color.get_col2());
        packet->Add(color.get_col3());
      }
      packet->CloseUnpack();
    }
    packet->CloseTag();
  }

After all this it’s time to actually render some primitives. First we have to upload the untransformed vertices in to buffer A. These verts are calculated once, procedurally, at the beginning of the program and stored in a VIF RET packet, allowing the DMA stream to execute a call and return something like a function call.

 

 if(inform->type == torus) {
    packet->Call(*(inform->m_torus.vif_packet));
    packet->CloseTag();
 } else if(inform->type == sphere) {
    packet->Call(*(inform->m_sphere.vif_packet));
    packet->CloseTag();
 }

After the data had been uploaded to buffer A we can set about generating instances of the primitive. To do this, all we have to set the header information at and call the program. Lather, rinse, repeat.

  void Torus::add_header_packet(CVifSCDmaPacket *packet,
                            mat_44 &object_screen,
                            mat_44 &object_light,
                            vec_xyz surface_color)
  {
    packet->Nop();
    packet->Nop();
    packet->Flushe();
    packet->OpenUnpack(Vifs::UnpackModes::v4_32, 7, Packet::kSingleBuff);
    {
      packet->Add(object_screen); // 4x4 matrix
      packet->Add(object_light.get_col0());
      packet->Add(object_light.get_col1());
      packet->Add(object_light.get_col2());
      packet->Add(surface_color * 255.0f);
      packet->Add(input_buffer);
      packet->Add(output_buffer);
    }
    packet->CloseUnpack();
    packet->Mscal((uint32)torus_start_here >> 3);
    packet->Nop();
    packet->Nop();
    packet->Nop();
  }

So we’ve generated all the horns and filled the DMA stream for the next frame. All that’s left to do is to flip the double buffered screen to show the previous render, swap the buffer pointers (making the current packet into the previous packet) and render the previous frame.

// wait for vsync
wait_for_vsync();

// wait for the current frame to finish drawing (should be done by now)...
wait_for_GS_to_finish();

// ...then swap double buffers...
swap_screen_double_buffers();

// ... and send the next frame for rendering.
packet->Send(Packet::kDontWait, Packet::kFlushCache);

// swap active and previous packets.
CVifSCDmaPacket *temp_packet = packet;
packet = last_packet;
last_packet = temp_packet;

// clear the current packet for new data
packet->Reset();

Further Optimizations and Tweaks
The program as it stands is not as optimal as it could be. Here are a couple of ideas for increasing the speed of the program.

  • Move the light transform out of the rendering inner loop. The inner loop currently stands at 22 cycles per vertex, mainly because each vertex normal has to be transformed from object space into world space for lighting. There are tens of normals per primitive but only one lighting matrix. It would be more efficient to transform the light direction matrix to object space once per primitive and use that matrix for lighting calculations. This would save at least 4 cycles per vertex.
  • Double buffer the Header. Double buffering the header would allow you to remove the Flushe() in rendering the primitives.
  • Load the constants only once. A tiny optimization with little real effect on polygon throughput (seven cycles per primitive), but it would tidy things up a little.
  • Code the horn algorithm as a VU0 Micro Mode program. The slowest part of the program are the string of matrix multiplies used to position and scale each primitive. Each matrix multiply, although executed using Macro Mode VU0 code, is not using VU0 to it’s full. The routine could be coded as a VU0 Micro Mode program that takes the variables needed to specify a horn and generates the 3 to 50 matrices needed per horn in one go. The major problem with this conversion is that VU0 has no instructions for calculating sin() and cos() or calculating a powf(), but these are just programming problems. Algorithms and table based approximations for these functions are simple to find on the net. For example, we only have to evaluate sin() and cos() at fixed intervals allowing us to use forward differencing, Taylor Series approximations or Goertzels algorithm to generate the values. Other functions can be derived from more primitive power series:

        powf(float x,float y) = exp( y * log(x) )

    The only drawback of this technique is that the EE Core will have to wait for the full set of matrices to be returned before it can generate the VU header packets. However, if you think about it, the EE Core is already waiting for VU0 in tiny pieces scattered throughout the programs execution. The work has to be done anyway so why not batch it together and use the computation time constructively.
  • Move the whole Horn algorithm into VU1. It’s possible to move the whole algorithm into VU1, even to the point of generating the vertices of each primitive at runtime. The benefits to bus bandwidth are obvious – all you send across are a handful of instructions and floats per horn and nothing more, plus there would be a lot of satisfaction in creating such a monster program. The drawback is more sticky though - you would again be serializing the operation to only one unit. It may ultimately be more efficient to distribute the job over several processors.

Results

______________________________________________________

[Back to] Lifeforms


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2002 CMP Media LLC. All rights reserved.
privacy policy | terms of service