Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Gamasutra: The Art & Business of Making Gamesspacer
Procedural Rendering on Playstation 2
View All     RSS
May 18, 2021
arrowPress Releases
May 18, 2021
Games Press
View All     RSS







If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 

Procedural Rendering on Playstation 2


September 26, 2001 Article Start Previous Page 7 of 7
 

Because the VU units don’t have a special dot product instruction we have to piece this together out of multiplies and adds. It turns out that doing three dot products takes the same time as doing one so we may as well use three light sources:

 

color = Ksurface * Sum(n, In * N.Ln)
= Kr * (I0,r*N.L0 + I1,r*N.L1 + I2,r*N.L2 )
Kg * (I0,r*N.L0 + I1,r*N.L1 + I2,r*N.L2 )
Kb * (I0,r*N.L0 + I1,r*N.L1 + I2,r*N.L2 )

So, first we calculate the dot products into a single vector – this only works because our light vectors are stored in a transposed matrix:

x y z w
lighting0 = | L0.x L1.x L2.x 0 |
lighting1 = | L0.y L1.y L2.y 0 |
lighting1 = | L0.z L1.z L2.z 0 |

N.L0 =
N.L1 =
N.L2 =
L0.x * N.x
L1.x * N.x
L2.x * N.x
+ L0.y * N.y
+ L1.y * N.y
+ L2.y * N.y
+ L0.z * N.z
+ L1.z * N.z
+ L2.z * N.z
  mulax.xyz … madday.xyz … maddz.xyz …

 

Then we multiply through by the light colors to get the final vertex color:

r =
g =
b =
I0r * N.L0
I0g * N.L0
I0b * N.L0
+ I1r * N.L1
+ I1g * N.L1
+ I1b * N.L1
+ I2r * N.L2
+ I2g * N.L2
+ I2b * N.L2
+ 1.0 * Ar
+ 1.0 * Ag
+ 1.0 * Ab
  mulax.xyz … madday.xyz … maddaz.xyz … maddw.xyz …


Header. The header contains the additional information needed to render this particular instance of the primitive – a 4x4 object-screen matrix, a 3x3 matrix to transform the normals into world space for lighting calculations, the surface color for this primitive, the address of the input matrix and the address of where to put the transformed vertices.

All this information is calculated during the previous frame, embedded in a DMA packet and uploaded once per primitive at rendering time.

Untransformed Vertices. After the header is stored a GIF Tag (from which we can work out the number of vertices in the packet) and the untransformed vertices and normals.

The VU Program
Now we have all the information needed to design the VU program. We know where the matrices are going to be stored, we know where to get our GIF tags from and we know how many verts need to be transformed for each primitive (it’s the last 8 bits of the GIF tag). We will need to:

  • Transform the vertex to screen space, divide by W and covert the result to an integer.
  • Transform the normal into light space (a 3x3 matrix calculated by the horn function).
  • Calculate N.L and multiply it by the light colors.
  • Multiply the resulting light intensity by the surface color.
  • Store the results in the output buffer and loop.

VCL compiles the program resulting in an inner loop of 22 cycles. This can be improved (see later) but it’s not bad for so little effort.

Running Order

The first job the program has is to upload the VU programs to VU Program Memory. There are two programs in the packet, one for transforming and lighting toruses and one for transforming and
lighting spheres, positioned one after the other. Uploading is achieved using an mpg VIF Tag that loads the program starting at a specified location in VU Program Memory.

A short script generates an object file that can be linked into your executable, and also defines four global variables for you to use as extern pointers. vu1_packet_begin and vu1_packet_end allow you to get the starting address and (should you want it) the length of the of the DMA packet. torus_start_here and sphere_start_here are the starting addresses of the two programs relative to the start of VU Program Memory. You can use these values for the mscal VIF instruction.


void upload_vu1_programs()
{
CSCDmaPacket vu1_upload((uint128*)






vu1_upload.Send();
}


(&vu1_packet_begin),
((uint32)vu1_packet_end) / 16,
DMAC::Channels::vif1,
Packet::kXferTags,
Core::MemMappings::Normal,
Packet::kFull );

The program then enters it’s rendering loop. The job of the rendering loop is to render the previous frame and calculate the DMA packets for the next frame. To do this we define two global DMA lists in uncached accelerated main memory:

CVifSCDmaPacket *packet = new CVifSCDmaPacket(80000,
DMAC::Channels::vif1,
Packet::kDontXferTags,
Core::MemMappings::UncachedAccl );
CVifSCDmaPacket*last_packet = new CVifSCDmaPacket(80000,
DMAC::Channels::vif1,
Packet::kDontXferTags,
Core::MemMappings::UncachedAccl );

For the first iteration we fill the previous frame with an empty DMA tag so that it will do nothing.

last_packet->End();
last_packet->CloseTag();

From this point on all data for the next frame gets appended to
the end of the current DMA packet called, usefully, packet.

The next job is to upload the constants. This is done once per frame, just in case you want to animate the lighting for each render. Also in this packet we set up the double buffering base and offset.

void upload_light_constants(CVifSCDmaPacket *packet, mat_44 &direction, mat_44 &color)
{
// upload light constants at location 0 in VU1 memory
packet->Cnt();
{
packet->Base(349);
packet->Offset(337);
packet->Stcycl(1,1);
packet->OpenUnpack(Vifs::UnpackModes::v4_32, 0, Packet::kSingleBuff);
{
packet->Add(direction.get_col0());
packet->Add(direction.get_col1());
packet->Add(direction.get_col2());
packet->Add(color.get_col0());
packet->Add(color.get_col1());
packet->Add(color.get_col2());
packet->Add(color.get_col3());
}
packet->CloseUnpack();
}
packet->CloseTag();
}

After all this it’s time to actually render some primitives. First we have to upload the untransformed vertices in to buffer A. These verts are calculated once, procedurally, at the beginning of the program and stored in a VIF RET packet, allowing the DMA stream to execute a call and return something like a function call.

 

if(inform->type == torus) {
packet->Call(*(inform->m_torus.vif_packet));
packet->CloseTag();
} else if(inform->type == sphere) {
packet->Call(*(inform->m_sphere.vif_packet));
packet->CloseTag();
}

After the data had been uploaded to buffer A we can set about generating instances of the primitive. To do this, all we have to set the header information at and call the program. Lather, rinse, repeat.

void Torus::add_header_packet(CVifSCDmaPacket *packet,
mat_44 &object_screen,
mat_44 &object_light,
vec_xyz surface_color)
{
packet->Nop();
packet->Nop();
packet->Flushe();
packet->OpenUnpack(Vifs::UnpackModes::v4_32, 7, Packet::kSingleBuff);
{
packet->Add(object_screen); // 4x4 matrix
packet->Add(object_light.get_col0());
packet->Add(object_light.get_col1());
packet->Add(object_light.get_col2());
packet->Add(surface_color * 255.0f);
packet->Add(input_buffer);
packet->Add(output_buffer);
}
packet->CloseUnpack();
packet->Mscal((uint32)torus_start_here >> 3);
packet->Nop();
packet->Nop();
packet->Nop();
}

So we’ve generated all the horns and filled the DMA stream for the next frame. All that’s left to do is to flip the double buffered screen to show the previous render, swap the buffer pointers (making the current packet into the previous packet) and render the previous frame.

 

// wait for vsync
wait_for_vsync();

// wait for the current frame to finish drawing (should be done by now)...
wait_for_GS_to_finish();

// ...then swap double buffers...
swap_screen_double_buffers();

// ... and send the next frame for rendering.
packet->Send(Packet::kDontWait, Packet::kFlushCache);

// swap active and previous packets.
CVifSCDmaPacket *temp_packet = packet;
packet = last_packet;
last_packet = temp_packet;

// clear the current packet for new data
packet->Reset();

Further Optimizations and Tweaks
The program as it stands is not as optimal as it could be. Here are a couple of ideas for increasing the speed of the program.

  • Move the light transform out of the rendering inner loop. The inner loop currently stands at 22 cycles per vertex, mainly because each vertex normal has to be transformed from object space into world space for lighting. There are tens of normals per primitive but only one lighting matrix. It would be more efficient to transform the light direction matrix to object space once per primitive and use that matrix for lighting calculations. This would save at least 4 cycles per vertex.
  • Double buffer the Header. Double buffering the header would allow you to remove the Flushe() in rendering the primitives.
  • Load the constants only once. A tiny optimization with little real effect on polygon throughput (seven cycles per primitive), but it would tidy things up a little.
  • Code the horn algorithm as a VU0 Micro Mode program. The slowest part of the program are the string of matrix multiplies used to position and scale each primitive. Each matrix multiply, although executed using Macro Mode VU0 code, is not using VU0 to it’s full. The routine could be coded as a VU0 Micro Mode program that takes the variables needed to specify a horn and generates the 3 to 50 matrices needed per horn in one go. The major problem with this conversion is that VU0 has no instructions for calculating sin() and cos() or calculating a powf(), but these are just programming problems. Algorithms and table based approximations for these functions are simple to find on the net. For example, we only have to evaluate sin() and cos() at fixed intervals allowing us to use forward differencing, Taylor Series approximations or Goertzels algorithm to generate the values. Other functions can be derived from more primitive power series:

    powf(float x,float y) = exp( y * log(x) )

    The only drawback of this technique is that the EE Core will have to wait for the full set of matrices to be returned before it can generate the VU header packets. However, if you think about it, the EE Core is already waiting for VU0 in tiny pieces scattered throughout the programs execution. The work has to be done anyway so why not batch it together and use the computation time constructively.
  • Move the whole Horn algorithm into VU1. It’s possible to move the whole algorithm into VU1, even to the point of generating the vertices of each primitive at runtime. The benefits to bus bandwidth are obvious – all you send across are a handful of instructions and floats per horn and nothing more, plus there would be a lot of satisfaction in creating such a monster program. The drawback is more sticky though - you would again be serializing the operation to only one unit. It may ultimately be more efficient to distribute the job over several processors.

Results

 


Article Start Previous Page 7 of 7

Related Jobs

Remedy Entertainment
Remedy Entertainment — Espoo, Finland
[05.18.21]

Outsourcing Manager
Square Enix Co., Ltd.
Square Enix Co., Ltd. — Tokyo, Japan
[05.18.21]

Experienced Game Developer
Sunday GmbH
Sunday GmbH — Hamburg, Germany
[05.18.21]

Game Lead (m/w/d)
Deep Silver Volition
Deep Silver Volition — Champaign, Illinois, United States
[05.17.21]

Senior Project Manager





Loading Comments

loader image