Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Gamasutra: The Art & Business of Making Gamesspacer
Procedural Rendering on Playstation 2
View All     RSS
May 18, 2021
arrowPress Releases
May 18, 2021
Games Press
View All     RSS

If you enjoy reading this site, you might also want to check out these UBM Tech sites:


Procedural Rendering on Playstation 2

September 26, 2001 Article Start Previous Page 4 of 7 Next

7. Myth: Scratchpad is for speed.
The Scratchpad is the 16KB area of memory that is actually on-chip in the EE Core. Using some MMU shenanigans at boot up time, the EE Core makes Scratchpad RAM (SPR) appear to be part of the normal memory map. The thing to note about SPR is that reads and writes to SPR are uncached and memory accesses don’t go through the memory bus – it’s on-chip and physically sitting next to (actually inside) the CPU.

You could think of scratchpad as a fast area of memory, like the original PSX, but real world timings show that it’s not that much faster than Uncached Accelerated memory for sequential work or in-cache data for random work. The best way to think of SPR is as a place to work while the data bus is busy - something like a playground surrounded by roads with heavy traffic.

Picture this: Your program has just kicked off a huge DMA chain of events that will automatically upload and execute VU programs and move information through the system. The DMAC is moving information from unit to unit over the Memory Bus in 8-qword chunks, checking for interruptions every tick and CPU has precedence. The last thing the DMAC needs is to be interrupted every 8 clock cycles with the CPU needing to use the bus for more data. This is why the designers gave you an area of memory to play with while this happens. Sure, the Instruction and Data caches play their part but they are primarily there to aid throughput of instructions.

Scratchpad is there to keep you off the data bus – use it to batch up memory writes and move the data to main memory using burst-mode DMA transfers using the fromSPR DMA channel.

8. There is no such thing as “The Pipeline”.
The best way to think about the rendering hardware in PS2 is a series of optimized programs that run over your data and pipe the resulting polygon lists to the GS. Within a frame there may be many different renderers – one for unclipped models, one for procedural models, one for specular models, one for subdivision surfaces, etc.

As each renderer is less than 16KB of VU code they are very cheap to upload compared to the amount of polygon data they will be generating. Program uploads can be embedded inside DMA chains to complete the automation process, e.g.


9. Speed is all about the Bus.
This has been said many times before, but it bears repeating. The theoretical speed limits of the GS are pretty much attainable, but only by paying attention to the bus speed. The GS can kick one triangle every clock tick (using tri-strips) at 150MHz. This gives us a theoretical upper limit of:

150 million verts per second = 2.5 million verts / frame at 60Hz

Given that each of these polygons will be flat shaded the result isn’t very interesting. We will need to factor in a perspective transform, clipping and lighting which are done on the VUs, which run at 300MHz. The PS2 FAQ says these operations can take 15 – 20 cycles per vertex typically, giving us a throughput of:

5 million verts / 20 cycles per vertex
= 250,000 verts per frame
= 15 million verts per second
5 million verts / 15 cycles per vertex
= 333,000 verts per frame
= 20 million verts per second

Notice the difference here. Just by removing five cycles per vertex we get a huge increase in output. This is the reason we need different renderers for every situation – each renderer can shave off precious cycles-per-vertex by doing only the work necessary.

This is also the reason we have two VUs – often VU1 is often described as the “rendering” VU and VU0 as the “everything else” renderer, but this is not necessarily so. Both can be transforming vertices but only one can be feeding the GIF, and this explains the Memory FIFO you can set up: one VU is feeding the GS while the other is filling the FIFO. It also explains why we have two rendering contexts in the GS, one for each of the two input streams.

10. There are new tools to help you.
Unlike the early days of the PS2 where everything had to be painstakingly pieced together from the manuals and example code, lately there are some new tools to help you program PS2. Most of these are freely available for registered developers from the PS2 support websites and nearly all come with source.

DMA Disassembler. This tool, from SCEE’s James Russell, takes a completes DMA packet, parses it and generates a printout of how the machine will interpret the data block when it is sent. It can report errors in the chain and provides an excellent visual report of your DMA chain.

Packet Libraries. Built by Tyler Daniel, this set of classes allows easy construction of DMA packets, either at fixed locations in memory or in dynamically allocated buffers. The packet classes are styled after insertion-only STL containers and know how to add VIF tags, create all types of DMA packet and will calculate qword counts for you.

Vector Libraries and GCC patch. The GCC inline assembler patch adds a number of new features to the inline assembler:

  • Introduces a new j register type for 128-bit vector registers, allowing the compiler to know that these values are to be assigned to VU0 Macro Mode registers
  • Allows register naming, so more descriptive symbols can be used.
  • Allows access to fields in VU0 broadcast instructions allowing you to, say, template a function across broadcast fields (x, xy, xyz, xyzw)
  • No more volatile inline assembly, the compiler is free to reorder instructions as the context is properly described.
  • No more explicit loading and moving registers to and from VU registers, the compiler is free to keep values in VU registers as long as possible.
  • No need to use explicit registers, unless you want to. The compiler can assign free registers

The patch is not perfect as there is still a limit to 10 input and output registers per section of inline assembly, and that can be a little painful at times (i.e. three operand 4x4 matrix operations like a = b * c take 12 registers to declare), but it is at least an improvement.

The Matrix and Vector classes showcase the GCC assembler patch, providing a set of template classes that produce fairly optimized results, plus being way easier to write and alter to your needs:

vec_x dot( const vec_xyz rhs ) const
vec128_t result, one;
" ### vec_xyzw dot vec_xyzw ### \n"
"vmul result, lhs, rhs \n"
"vaddw.x one, vf00, vf00 \n"
"vaddax.x ACC, vf00, result \n"
"vmadday.x ACC, one, result \n"
"vmaddz.x result, one, result \n"
: "=j result" (result),
"=j one" (one)
: "j lhs" (*this),
"j rhs" (rhs)
return vec_x(result);

VU Command Line preprocessor. As mentioned earlier, one of the newest tools to aid PS2 programming is VCL, the vector code optimizing preprocessor. It takes a single stream of VU instructions and:

  • Automatically pairs instructions into upper and lower streams.
  • Intelligently breaks code into looped sections.
  • Unrolls and interleaves loops, producing correct header and footer sections. . Inserts necessary nops between instructions.
  • Allows symbolic referencing of registers by assigning a free register to the symbol at first use. (The set of free regs is declared at the beginning of a piece of VCL code).
  • Tracks vector element usage based on the declared type – it can ensure a vector element that has been declared as an integer but held in a float is treated correctly.

No more writing VU code in Excel! It outputs pretty well optimized results that can be used as a starting point for hand coding (It can also be run on already existing code to see if any improvements can be made).

VCL is not that intelligent yet (it will happily optimize complete rubbish). For the best results it’s worth learning how to code in a VCL friendly style, e.g.:

  • Instead of directly incrementing pointers:

    lqi vector, (address++)
    lqi normal, (address++)

    You should use offset addressing:

    lq vector, 0(address)
    lq normal, 1(address)
    iaddi address, address, 2
  • Make sure that all members of a vector type are accounted for, e.g. when calculating normal lighting only the xyz part of a vector is needed, so remember to set the w value to a constant in the preamble, thus breaking a dependency chain that prevents VCL from interleaving unrolled loop sections:

    sub.w normal, normal, vf00

More of these techniques are in the VCL documentation. It’s really satisfying to be able to cut and paste blocks of code together to get the VU program you need and not need to worry about pairing instructions and inserting nops.

Built-in Profiling Registers. The EE Core, like all MIPS processors, has a number of performance registers built into Coprocessor 0 (the CPU control unit). The PerfTest class reads these registers and can print out a running commentary on the efficiency of any section of code you want to sample.

Performance Analyzer. SCE have just announced it’s hardware Performance Analyzer (PA). It’s a hardware device that samples activity on the busses and produces graphs, analysis and in depth insights into your algorithms. Currently the development support offices are being fitted with these devices and your teams will be able to book consultation time with them.

Article Start Previous Page 4 of 7 Next

Related Jobs

Remedy Entertainment
Remedy Entertainment — Espoo, Finland

Outsourcing Manager
Square Enix Co., Ltd.
Square Enix Co., Ltd. — Tokyo, Japan

Experienced Game Developer
Sunday GmbH
Sunday GmbH — Hamburg, Germany

Game Lead (m/w/d)
Deep Silver Volition
Deep Silver Volition — Champaign, Illinois, United States

Senior Project Manager

Loading Comments

loader image