It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.    

Search articles, jobs, buyers guide, and more.

By Robin Green
Gamasutra
[Author's Bio]
September 26, 2001

Lifeforms

The Transformation List

TenThings Nobody Told You About PS2

VU Dataflow Designs

The Design

Printer Friendly Version

 

Letters to the Editor:
Write a letter
View all letters


Features

Procedural Rendering on Playstation 2

TenThings Nobody Told You About PS2

Before we start translating this Lifeform algorithm into a PS2 friendly design, I’d like to cover some more about the PS2. Later we’ll use these insights to re-order and tweak the algorithm to get some speed out of the machine.

1. You must design before coding.
Lots of people have said this about PS2 – you cannot just sit down and code away and expect high speed programs as a result. You have to plan and design your code around the hardware and that requires insight into how the machine works. Where do you get this insight from? This is where this paper comes in. The aim later is to present some of the boilerplate designs you can use as a starting point.

2. The compiler doesn’t make things easy for you.
Many of the problems with programming PS2 come from the limitations of the compiler. Each General Purpose Register in the Emotion Engine is 128-bits in length but the compiler only supports these as a special case type. Much of the data you need to pass around the machine comes in 128-bit packets (GIF tags, 4D float vectors, etc.) so you will spend a lot of time casting between different representations of the same data type, paying special attention to alignment issues. A lot of this confusion can be removed if you have access to well designed Packet and 3D Vector classes.

Additionally the inline assembler doesn’t have enough information to make good decisions about uploading and downloading VU0 Macro Mode registers and generating broadcast instructions on vector data types. There is a patch for the ee-gcc compiler by Tyler Daniel and Dylan Cuthbert that update the inline assembler to add register naming, access to broadcast fields and new register types which are used to good effect in our C++ Vector classes. It’s by no means perfect as you’re still limited to only 10 input and output registers, but it’s a significant advance.

3. All the hardware is memory mapped.
Ne
arly all of the basic tutorials I have seen for PS2 have started by telling you that, in order to get anything rendered on screen, you have to learn all about DMA tags, VIF tags and GIF tags, alignment, casting and enormous frustration before your “Hello World” program will work. The tutorials always seem to imply that the only way to access outboard hardware is through painstakingly structured DMA packets. This statement is not true, and it greatly complicates the process of learning PS2. In my opinion this is one of the reasons the PS2 is rejected as “hard to program”.

Much of this confusion comes from the lack of a detailed memory map of the PS2 in the documentation. Understandably, the designers were reticent to provide one as the machine was in flux at the time of writing (the memory layout is completely reconfigurable by the kernel at boot time) and they were scared of giving programmers bad information. Let’s change this.

All outboard registers and memory areas are freely accessible at fixed addresses. Digging through the headers you will come across a header called eeregs.h that holds the key. In here are the hard-coded addresses of most of the internals of the machine. First a note here about future proofing your programs. Accessing these registers directly in final production code is not advisable as it’s fully possible that the memory map could change with future versions of the PS2. These techniques are only outlined here for tinkering around and learning the system so you can prove to yourself there’s no magic here. Once you have grokked how the PS2 and the standard library functions work, it’s safest to stick to using the libraries.

Let’s take a look at a few of the values in the header and see what they mean:

     #define VU1_MICRO      ((volatile u_long *)(0xNNNNNNNN))
     #define VU1_MEM        ((volatile u_long128 *)(0xNNNNNNNN))

These two addresses are the start addresses of VU1 program and data memory if VU1 is not currently caclulating. Most tutorials paint VU1 as “far away”, a hands off device that’s unforgiving if you get a single instruction wrong and consequently hard to debug. Sure, the memory is unavailable if VU1 is running a program, but using these addresses you can dump the contents before and after running VU programs. Couple this knowledge with the DMA Disassembler and VCL, the vector code compiler, and VU programming without expensive proprietary tools and debuggers is not quite as scary as it seems.

#define D2_CHCR
#define D2_MADR
#define D2_QWC
#define D2_TADR
#define D2_ASR0
#define D2_ASR1

#define D3_CHCR
#define D3_MADR
#define D3_QWC
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))

((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))

If you have only read the SCE libraries you may be under the impression that “Getting a DMA Channel” is an arcane and complicated process requiring a whole function call. Far from it. The DMA channels are not genuine programming abstractions, in reality they’re just a bank of memory mapped registers. The entries in the structure sceDmaChan map direc tly onto these addresses like a cookie cutter.

     #define GIF_FIFO      ((volatile u_long128 *)(0xNNNNNNNN))

The GIF FIFO is the doorway into the Graphics Synthesizer. You push qwords in here one after another and the GS generates polygons - simple as that. No need to use DMA to get your first program working, just program up a GIF Tag with some data and stuff it into this address.

This leads me to my favorite insight into the PS2…

4. The DMAC is just a Pigeon Hole Stuffer.
The DMA Controller (DMAC) is a very simple beast. In essence all it does is read a qword from a source address, write it to a destination address, increments one or both of these addresses, decrements a counter and loops. When you’re DMAing data from memory to the GIF all that’s happening is that the DMA chip is reading from the source address and pushing the quads through the GIF_FIFO we mentioned earlier – that DMA Channel has a hard-wired destination address.

5. Myth: VU code is hard.
VU code isn’t hard. Fast VU code is hard, but there are now some tools to help you get 80 percent of the way there for a lot less effort.

VCL (Vector Command Line, as opposed to the interactive graphic version) is a tool that preprocesses a single stream of VU code (no paired instructions necessary), analyses it for loop blocks and control flow, pairs and rearranges instructions, opens loops and interleaves the result to give pretty efficient code. For example, take this simplest of VU programs that takes a block of vectors and in-place matrix multiplies them by a fixed matrix, divides by W and integerizes the value:

; test.vcl
; simplest vcl program ever
.init_vf_all
.init_vi_all
--enter
--endenter
.name start_here
start_here:
   ilw.x srce_ptr 0(vi00)
   ilw.x counter, 1(vi00)
   iadd counter, counter, srce_ptr
   lq v_transf0 2(vi00)
   lq v_transf1 3(vi00)
   lq v_transf2 4(vi00)
   lq v_transf3 5(vi00)
loop:
   --LoopCS 6, 1
   lq vec, 0(srce_ptr)
   mulax.xyzw ACC, v_transf0, vec
   madday.xyzw ACC, v_transf1, vec
   maddaz.xyzw ACC, v_transf2, vec
   maddw.xyzw vec, v_transf3, vf00
   div Q, vf0w, vecw
   mulq.xyzw vec, vec, Q
   ftoi4.xyzw vec, vec
   sq vec, 0(srce_ptr)
   iaddiu srce_ptr,srce_ptr,1
   ibne srce_ptr, counter, loop
--exit
--endexit
. . .

VCL takes the source code, pairs the instructions and unwrap the loop to this seven instruction inner loop (with entry and exit blocks not shown):

loop__MAIN_LOOP:
; [0,7) size=7 nU=6 nL=7 ic=13 [lin=7 lp=7]
 maddw VF09,VF04,VF00w   lq.xyz  VF08,0(VI01)
 nop                     sq      VF07,(0)-(5*(1))(VI01)
 ftoi4 VF07,VF06         iaddiu  VI01,VI01,1
 mulq VF06,VF05,Q        move    VF05,VF10
 mulax ACC,VF01,VF08x    div     Q,VF00w,VF09w
 madday ACC,VF02,VF08y   ibne    VI01,VI02,loop__MAIN_LOOP
 maddaz ACC,VF03,VF08z   move    VF10,VF09

6. Myth: Synchronization is complicated.
The problem with synchronization is that much of it is built into the hardware and the documentation isn’t clear about what’s happening and when. Synchronization points are described variously as “stall states” or hidden behind descriptions of queues and scattered all over the documentation. Nowhere is there a single list of “How to force a wait for X” techniques.

The first point to make is that complicated as general purpose synchronization is, when we are rendering to screen we are dealing with a more limited problem: you only need to keep things in sync once a frame. All your automatic processes can kick off and be fighting for resources during a frame, but as soon as you reach the end of rendering the frame then everything must be finished. You are only dealing with short bursts of synchronization.

The PS2 has three main systems for synchronization:

  • synchronization within the EE Core
  • synchronization between the EE Core and external devices
  • synchronization between external devices.

This whole area is worthy of a paper in itself as much of this information is spread around the documentation. Breaking the problem down into these three areas sheds allows you to grok the whole system. Briefly summarizing:

Within the EE Core we have sync.l and sync.e instructions that guarantee that results are finished before continuing with execution.

Between the EE Core and external devices (VIF, GIF, DMAC, etc) we have a variety of tools. Many events can generate interrupts upon completion, the VIF has a mark instruction that sets the value of a register that can be read by the EE Core allowing the EE Core to know that a certain point has been reached in a DMA stream and we have the memory mapped registers that contain status bits that can be polled.

Between external devices there is a well defined set of priorities that cause execution orders to be well defined. The VIF can also be forced to wait using flush, flushe and flusha instructions. These are the main ones we’ll be using in this tutorial.

7. Myth: Scratchpad is for speed.
The Scratchpad is the 16KB area of memory that is actually on-chip in the EE Core. Using some MMU shenanigans at boot up time, the EE Core makes Scratchpad RAM (SPR) appear to be part of the normal memory map. The thing to note about SPR is that reads and writes to SPR are uncached and memory accesses don’t go through the memory bus – it’s on-chip and physically sitting next to (actually inside) the CPU.

You could think of scratchpad as a fast area of memory, like the original PSX, but real world timings show that it’s not that much faster than Uncached Accelerated memory for sequential work or in-cache data for random work. The best way to think of SPR is as a place to work while the data bus is busy - something like a playground surrounded by roads with heavy traffic.

Picture this: Your program has just kicked off a huge DMA chain of events that will automatically upload and execute VU programs and move information through the system. The DMAC is moving information from unit to unit over the Memory Bus in 8-qword chunks, checking for interruptions every tick and CPU has precedence. The last thing the DMAC needs is to be interrupted every 8 clock cycles with the CPU needing to use the bus for more data. This is why the designers gave you an area of memory to play with while this happens. Sure, the Instruction and Data caches play their part but they are primarily there to aid throughput of instructions.

Scratchpad is there to keep you off the data bus – use it to batch up memory writes and move the data to main memory using burst-mode DMA transfers using the fromSPR DMA channel.

8. There is no such thing as “The Pipeline”.
The best way to think about the rendering hardware in PS2 is a series of optimized programs that run over your data and pipe the resulting polygon lists to the GS. Within a frame there may be many different renderers – one for unclipped models, one for procedural models, one for specular models, one for subdivision surfaces, etc.

As each renderer is less than 16KB of VU code they are very cheap to upload compared to the amount of polygon data they will be generating. Program uploads can be embedded inside DMA chains to complete the automation process, e.g.

 

9. Speed is all about the Bus.
This has been said many times before, but it bears repeating. The theoretical speed limits of the GS are pretty much attainable, but only by paying attention to the bus speed. The GS can kick one triangle every clock tick (using tri-strips) at 150MHz. This gives us a theoretical upper limit of:

     150 million verts per second = 2.5 million verts / frame at 60Hz

Given that each of these polygons will be flat shaded the result isn’t very interesting. We will need to factor in a perspective transform, clipping and lighting which are done on the VUs, which run at 300MHz. The PS2 FAQ says these operations can take 15 – 20 cycles per vertex typically, giving us a throughput of:

      5 million verts / 20 cycles per vertex
           = 250,000 verts per frame
           = 15 million verts per second
      5 million verts / 15 cycles per vertex
           = 333,000 verts per frame
           = 20 million verts per second

Notice the difference here. Just by removing five cycles per vertex we get a huge increase in output. This is the reason we need different renderers for every situation – each renderer can shave off precious cycles-per-vertex by doing only the work necessary.

This is also the reason we have two VUs – often VU1 is often described as the “rendering” VU and VU0 as the “everything else” renderer, but this is not necessarily so. Both can be transforming vertices but only one can be feeding the GIF, and this explains the Memory FIFO you can set up: one VU is feeding the GS while the other is filling the FIFO. It also explains why we have two rendering contexts in the GS, one for each of the two input streams.

10. There are new tools to help you.
Unlike the early days of the PS2 where everything had to be painstakingly pieced together from the manuals and example code, lately there are some new tools to help you program PS2. Most of these are freely available for registered developers from the PS2 support websites and nearly all come with source.

DMA Disassembler. This tool, from SCEE’s James Russell, takes a completes DMA packet, parses it and generates a printout of how the machine will interpret the data block when it is sent. It can report errors in the chain and provides an excellent visual report of your DMA chain.

Packet Libraries. Built by Tyler Daniel, this set of classes allows easy construction of DMA packets, either at fixed locations in memory or in dynamically allocated buffers. The packet classes are styled after insertion-only STL containers and know how to add VIF tags, create all types of DMA packet and will calculate qword counts for you.

Vector Libraries and GCC patch. The GCC inline assembler patch adds a number of new features to the inline assembler:

  • Introduces a new j register type for 128-bit vector registers, allowing the compiler to know that these values are to be assigned to VU0 Macro Mode registers
  • Allows register naming, so more descriptive symbols can be used.
  • Allows access to fields in VU0 broadcast instructions allowing you to, say, template a function across broadcast fields (x, xy, xyz, xyzw)
  • No more volatile inline assembly, the compiler is free to reorder instructions as the context is properly described.
  • No more explicit loading and moving registers to and from VU registers, the compiler is free to keep values in VU registers as long as possible.
  • No need to use explicit registers, unless you want to. The compiler can assign free registers

The patch is not perfect as there is still a limit to 10 input and output registers per section of inline assembly, and that can be a little painful at times (i.e. three operand 4x4 matrix operations like a = b * c take 12 registers to declare), but it is at least an improvement.

The Matrix and Vector classes showcase the GCC assembler patch, providing a set of template classes that produce fairly optimized results, plus being way easier to write and alter to your needs:

vec_x dot( const vec_xyz rhs ) const
{
  vec128_t result, one;
  asm(
    " ### vec_xyzw dot vec_xyzw ### \n"
    "vmul          result, lhs, rhs \n"
    "vaddw.x     one, vf00, vf00 \n"
    "vaddax.x    ACC, vf00, result \n"
    "vmadday.x ACC, one, result \n"
    "vmaddz.x   result, one, result \n"
    : "=j result" (result),
    "=j one"   (one)
    : "j lhs"   (*this),
    "j rhs"    (rhs)
  );
  return vec_x(result);
}

VU Command Line preprocessor. As mentioned earlier, one of the newest tools to aid PS2 programming is VCL, the vector code optimizing preprocessor. It takes a single stream of VU instructions and:

  • Automatically pairs instructions into upper and lower streams.
  • Intelligently breaks code into looped sections.
  • Unrolls and interleaves loops, producing correct header and footer sections. . Inserts necessary nops between instructions.
  • Allows symbolic referencing of registers by assigning a free register to the symbol at first use. (The set of free regs is declared at the beginning of a piece of VCL code).
  • Tracks vector element usage based on the declared type – it can ensure a vector element that has been declared as an integer but held in a float is treated correctly.

No more writing VU code in Excel! It outputs pretty well optimized results that can be used as a starting point for hand coding (It can also be run on already existing code to see if any improvements can be made).

VCL is not that intelligent yet (it will happily optimize complete rubbish). For the best results it’s worth learning how to code in a VCL friendly style, e.g.:

  • Instead of directly incrementing pointers:

    lqi vector, (address++)
    lqi normal, (address++)


    You should use offset addressing:

    lq vector, 0(address)
    lq normal, 1(address)
    iaddi address, address, 2


  • Make sure that all members of a vector type are accounted for, e.g. when calculating normal lighting only the xyz part of a vector is needed, so remember to set the w value to a constant in the preamble, thus breaking a dependency chain that prevents VCL from interleaving unrolled loop sections:

    sub.w normal, normal, vf00

More of these techniques are in the VCL documentation. It’s really satisfying to be able to cut and paste blocks of code together to get the VU program you need and not need to worry about pairing instructions and inserting nops.

Built-in Profiling Registers. The EE Core, like all MIPS processors, has a number of performance registers built into Coprocessor 0 (the CPU control unit). The PerfTest class reads these registers and can print out a running commentary on the efficiency of any section of code you want to sample.

Performance Analyzer. SCE have just announced it’s hardware Performance Analyzer (PA). It’s a hardware device that samples activity on the busses and produces graphs, analysis and in depth insights into your algorithms. Currently the development support offices are being fitted with these devices and your teams will be able to book consultation time with them.

______________________________________________________

VU Dataflow Designs


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2002 CMP Media LLC. All rights reserved.
privacy policy | terms of service