|
Features

Procedural
Rendering on Playstation 2
TenThings
Nobody Told You About PS2
Before
we start translating this Lifeform algorithm into a PS2 friendly
design, I’d like to cover some more about the PS2. Later we’ll use
these insights to re-order and tweak the algorithm to get some speed
out of the machine.
1.
You must design before coding.
Lots of people have said this about PS2 – you cannot just sit down
and code away and expect high speed programs as a result. You have
to plan and design your code around the hardware and that requires
insight into how the machine works. Where do you get this insight
from? This is where this paper comes in. The aim later is to present
some of the boilerplate designs you can use as a starting point.
2.
The compiler doesn’t make things easy for you.
Many of the problems with programming PS2 come from the limitations
of the compiler. Each General Purpose Register in the Emotion Engine
is 128-bits in length but the compiler only supports these as a
special case type. Much of the data you need to pass around the
machine comes in 128-bit packets (GIF tags, 4D float vectors, etc.)
so you will spend a lot of time casting between different representations
of the same data type, paying special attention to alignment issues.
A lot of this confusion can be removed if you have access to well
designed Packet and 3D Vector classes.
Additionally
the inline assembler doesn’t have enough information to make good
decisions about uploading and downloading VU0 Macro Mode registers
and generating broadcast instructions on vector data types. There
is a patch for the ee-gcc compiler by Tyler Daniel and Dylan Cuthbert
that update the inline assembler to add register naming, access
to broadcast fields and new register types which are used to good
effect in our C++ Vector classes. It’s by no means perfect as you’re
still limited to only 10 input and output registers, but it’s a
significant advance.
3.
All the hardware is memory mapped.
Nearly
all of the basic tutorials I have seen for PS2 have started by telling
you that, in order to get anything rendered on screen, you have
to learn all about DMA tags, VIF tags and GIF tags, alignment, casting
and enormous frustration before your “Hello World” program will
work. The tutorials always seem to imply that the only way to access
outboard hardware is through painstakingly structured DMA packets.
This statement is not true, and it greatly complicates the process
of learning PS2. In my opinion this is one of the reasons the PS2
is rejected as “hard to program”.
Much
of this confusion comes from the lack of a detailed memory map of
the PS2 in the documentation. Understandably, the designers were
reticent to provide one as the machine was in flux at the time of
writing (the memory layout is completely reconfigurable by the kernel
at boot time) and they were scared of giving programmers bad information.
Let’s change this.
All
outboard registers and memory areas are freely accessible at fixed
addresses. Digging through the headers you will come across a header
called eeregs.h that holds the key. In here are the hard-coded addresses
of most of the internals of the machine. First a note here about
future proofing your programs. Accessing these registers directly
in final production code is not advisable as it’s fully possible
that the memory map could change with future versions of the PS2.
These techniques are only outlined here for tinkering around and
learning the system so you can prove to yourself there’s no magic
here. Once you have grokked how the PS2 and the standard library
functions work, it’s safest to stick to using the libraries.
Let’s
take a look at a few of the values in the header and see what they
mean:
#define
VU1_MICRO ((volatile u_long *)(0xNNNNNNNN))
#define VU1_MEM ((volatile
u_long128 *)(0xNNNNNNNN))
These
two addresses are the start addresses of VU1 program and data memory
if VU1 is not currently caclulating. Most tutorials paint VU1 as
“far away”, a hands off device that’s unforgiving if you get a single
instruction wrong and consequently hard to debug. Sure, the memory
is unavailable if VU1 is running a program, but using these addresses
you can dump the contents before and after running VU programs.
Couple this knowledge with the DMA Disassembler and VCL, the vector
code compiler, and VU programming without expensive proprietary
tools and debuggers is not quite as scary as it seems.
#define
D2_CHCR
#define D2_MADR
#define D2_QWC
#define D2_TADR
#define D2_ASR0
#define D2_ASR1
#define D3_CHCR
#define D3_MADR
#define D3_QWC |
((volatile
u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN))
((volatile u_int *)(0xNNNNNNNN)) |
If
you have only read the SCE libraries you may be under the impression
that “Getting a DMA Channel” is an arcane and complicated process
requiring a whole function call. Far from it. The DMA channels are
not genuine programming abstractions, in reality they’re just a
bank of memory mapped registers. The entries in the structure sceDmaChan
map direc tly onto these addresses like a cookie cutter.
#define
GIF_FIFO ((volatile u_long128 *)(0xNNNNNNNN))
The
GIF FIFO is the doorway into the Graphics Synthesizer. You push
qwords in here one after another and the GS generates polygons -
simple as that. No need to use DMA to get your first program working,
just program up a GIF Tag with some data and stuff it into this
address.
This
leads me to my favorite insight into the PS2…
4.
The DMAC is just a Pigeon Hole Stuffer.
The DMA Controller (DMAC) is a very simple beast. In essence all
it does is read a qword from a source address, write it to a destination
address, increments one or both of these addresses, decrements a
counter and loops. When you’re DMAing data from memory to the GIF
all that’s happening is that the DMA chip is reading from the source
address and pushing the quads through the GIF_FIFO we mentioned
earlier – that DMA Channel has a hard-wired destination address.
5.
Myth: VU code is hard.
VU code isn’t hard. Fast VU code is hard, but there are now some
tools to help you get 80 percent of the way there for a lot less
effort.
VCL
(Vector Command Line, as opposed to the interactive graphic version)
is a tool that preprocesses a single stream of VU code (no paired
instructions necessary), analyses it for loop blocks and control
flow, pairs and rearranges instructions, opens loops and interleaves
the result to give pretty efficient code. For example, take this
simplest of VU programs that takes a block of vectors and in-place
matrix multiplies them by a fixed matrix, divides by W and integerizes
the value:
; test.vcl
; simplest vcl program ever
.init_vf_all
.init_vi_all
--enter
--endenter
.name start_here
start_here:
ilw.x srce_ptr 0(vi00)
ilw.x counter, 1(vi00)
iadd counter, counter, srce_ptr
lq v_transf0 2(vi00)
lq v_transf1 3(vi00)
lq v_transf2 4(vi00)
lq v_transf3 5(vi00)
loop:
--LoopCS 6, 1
lq vec, 0(srce_ptr)
mulax.xyzw ACC, v_transf0, vec
madday.xyzw ACC, v_transf1, vec
maddaz.xyzw ACC, v_transf2, vec
maddw.xyzw vec, v_transf3, vf00
div Q, vf0w, vecw
mulq.xyzw vec, vec, Q
ftoi4.xyzw vec, vec
sq vec, 0(srce_ptr)
iaddiu srce_ptr,srce_ptr,1
ibne srce_ptr, counter, loop
--exit
--endexit
. . .
VCL
takes the source code, pairs the instructions and unwrap the loop
to this seven instruction inner loop (with entry and exit blocks
not shown):
loop__MAIN_LOOP:
; [0,7) size=7 nU=6 nL=7 ic=13 [lin=7 lp=7]
maddw VF09,VF04,VF00w lq.xyz VF08,0(VI01)
nop sq
VF07,(0)-(5*(1))(VI01)
ftoi4 VF07,VF06 iaddiu
VI01,VI01,1
mulq VF06,VF05,Q move
VF05,VF10
mulax ACC,VF01,VF08x div Q,VF00w,VF09w
madday ACC,VF02,VF08y ibne VI01,VI02,loop__MAIN_LOOP
maddaz ACC,VF03,VF08z move VF10,VF09
6.
Myth: Synchronization is complicated.
The problem with synchronization is that much of it is built into
the hardware and the documentation isn’t clear about what’s happening
and when. Synchronization points are described variously as “stall
states” or hidden behind descriptions of queues and scattered all
over the documentation. Nowhere is there a single list of “How to
force a wait for X” techniques.
The
first point to make is that complicated as general purpose synchronization
is, when we are rendering to screen we are dealing with a more limited
problem: you only need to keep things in sync once a frame. All
your automatic processes can kick off and be fighting for resources
during a frame, but as soon as you reach the end of rendering the
frame then everything must be finished. You are only dealing with
short bursts of synchronization.
The
PS2 has three main systems for synchronization:
-
synchronization within the EE Core
- synchronization
between the EE Core and external devices
- synchronization
between external devices.
This
whole area is worthy of a paper in itself as much of this information
is spread around the documentation. Breaking the problem down into
these three areas sheds allows you to grok the whole system. Briefly
summarizing:
Within
the EE Core we have sync.l and sync.e instructions that guarantee
that results are finished before continuing with execution.
Between
the EE Core and external devices (VIF, GIF, DMAC, etc) we have a
variety of tools. Many events can generate interrupts upon completion,
the VIF has a mark instruction that sets the value of a register
that can be read by the EE Core allowing the EE Core to know that
a certain point has been reached in a DMA stream and we have the
memory mapped registers that contain status bits that can be polled.
Between
external devices there is a well defined set of priorities that
cause execution orders to be well defined. The VIF can also be forced
to wait using flush, flushe and flusha instructions. These are the
main ones we’ll be using in this tutorial.
7.
Myth: Scratchpad is for speed.
The Scratchpad is the 16KB area of memory that is actually on-chip
in the EE Core. Using some MMU shenanigans at boot up time, the
EE Core makes Scratchpad RAM (SPR) appear to be part of the normal
memory map. The thing to note about SPR is that reads and writes
to SPR are uncached and memory accesses don’t go through the memory
bus – it’s on-chip and physically sitting next to (actually inside)
the CPU.
You
could think of scratchpad as a fast area of memory, like the original
PSX, but real world timings show that it’s not that much faster
than Uncached Accelerated memory for sequential work or in-cache
data for random work. The best way to think of SPR is as a place
to work while the data bus is busy - something like a playground
surrounded by roads with heavy traffic.
Picture
this: Your program has just kicked off a huge DMA chain of events
that will automatically upload and execute VU programs and move
information through the system. The DMAC is moving information from
unit to unit over the Memory Bus in 8-qword chunks, checking for
interruptions every tick and CPU has precedence. The last thing
the DMAC needs is to be interrupted every 8 clock cycles with the
CPU needing to use the bus for more data. This is why the designers
gave you an area of memory to play with while this happens. Sure,
the Instruction and Data caches play their part but they are primarily
there to aid throughput of instructions.
Scratchpad
is there to keep you off the data bus – use it to batch up memory
writes and move the data to main memory using burst-mode DMA transfers
using the fromSPR DMA channel.
8.
There is no such thing as “The Pipeline”.
The best way to think about the rendering hardware in PS2 is a series
of optimized programs that run over your data and pipe the resulting
polygon lists to the GS. Within a frame there may be many different
renderers – one for unclipped models, one for procedural models,
one for specular models, one for subdivision surfaces, etc.
As
each renderer is less than 16KB of VU code they are very cheap to
upload compared to the amount of polygon data they will be generating.
Program uploads can be embedded inside DMA chains to complete the
automation process, e.g.
9.
Speed is all about the Bus.
This has been said many times before, but it bears repeating. The
theoretical speed limits of the GS are pretty much attainable, but
only by paying attention to the bus speed. The GS can kick one triangle
every clock tick (using tri-strips) at 150MHz. This gives us a theoretical
upper limit of:
150
million verts per second = 2.5 million verts / frame at 60Hz
Given
that each of these polygons will be flat shaded the result isn’t
very interesting. We will need to factor in a perspective transform,
clipping and lighting which are done on the VUs, which run at 300MHz.
The PS2 FAQ says these operations can take 15 – 20 cycles per vertex
typically, giving us a throughput of:
5 million verts / 20 cycles per vertex
= 250,000
verts per frame
= 15
million verts per second
5 million verts / 15 cycles per vertex
= 333,000
verts per frame
= 20
million verts per second
Notice
the difference here. Just by removing five cycles per vertex we
get a huge increase in output. This is the reason we need different
renderers for every situation – each renderer can shave off precious
cycles-per-vertex by doing only the work necessary.
This
is also the reason we have two VUs – often VU1 is often described
as the “rendering” VU and VU0 as the “everything else” renderer,
but this is not necessarily so. Both can be transforming vertices
but only one can be feeding the GIF, and this explains the Memory
FIFO you can set up: one VU is feeding the GS while the other is
filling the FIFO. It also explains why we have two rendering contexts
in the GS, one for each of the two input streams.
10.
There are new tools to help you.
Unlike the early days of the PS2 where everything had to be painstakingly
pieced together from the manuals and example code, lately there
are some new tools to help you program PS2. Most of these are freely
available for registered developers from the PS2 support websites
and nearly all come with source.
DMA
Disassembler. This tool, from SCEE’s James Russell, takes a
completes DMA packet, parses it and generates a printout of how
the machine will interpret the data block when it is sent. It can
report errors in the chain and provides an excellent visual report
of your DMA chain.
Packet
Libraries. Built by Tyler Daniel, this set of classes allows
easy construction of DMA packets, either at fixed locations in memory
or in dynamically allocated buffers. The packet classes are styled
after insertion-only STL containers and know how to add VIF tags,
create all types of DMA packet and will calculate qword counts for
you.
Vector
Libraries and GCC patch. The GCC inline assembler patch adds
a number of new features to the inline assembler:
-
Introduces a new j register type for 128-bit vector registers,
allowing the compiler to know that these values are to be assigned
to VU0 Macro Mode registers
- Allows
register naming, so more descriptive symbols can be used.
- Allows
access to fields in VU0 broadcast instructions allowing you to,
say, template a function across broadcast fields (x, xy, xyz,
xyzw)
- No
more volatile inline assembly, the compiler is free to reorder
instructions as the context is properly described.
- No
more explicit loading and moving registers to and from VU registers,
the compiler is free to keep values in VU registers as long as
possible.
- No
need to use explicit registers, unless you want to. The compiler
can assign free registers
The
patch is not perfect as there is still a limit to 10 input and output
registers per section of inline assembly, and that can be a little
painful at times (i.e. three operand 4x4 matrix operations like
a = b * c take 12 registers to declare), but it is at least an improvement.
The
Matrix and Vector classes showcase the GCC assembler patch, providing
a set of template classes that produce fairly optimized results,
plus being way easier to write and alter to your needs:
vec_x
dot( const vec_xyz rhs ) const
{
vec128_t result, one;
asm(
" ### vec_xyzw dot vec_xyzw ### \n"
"vmul result,
lhs, rhs \n"
"vaddw.x one,
vf00, vf00 \n"
"vaddax.x ACC, vf00,
result \n"
"vmadday.x ACC, one, result \n"
"vmaddz.x result, one,
result \n"
: "=j result" (result),
"=j one" (one)
: "j lhs" (*this),
"j rhs" (rhs)
);
return vec_x(result);
}
VU
Command Line preprocessor. As mentioned earlier, one of the
newest tools to aid PS2 programming is VCL, the vector code optimizing
preprocessor. It takes a single stream of VU instructions and:
- Automatically
pairs instructions into upper and lower streams.
-
Intelligently breaks code into looped sections.
-
Unrolls and interleaves loops, producing correct header and footer
sections. . Inserts necessary nops between instructions.
-
Allows symbolic referencing of registers by assigning a free register
to the symbol at first use. (The set of free regs is declared
at the beginning of a piece of VCL code).
-
Tracks vector element usage based on the declared type – it can
ensure a vector element that has been declared as an integer but
held in a float is treated correctly.
No
more writing VU code in Excel! It outputs pretty well optimized
results that can be used as a starting point for hand coding (It
can also be run on already existing code to see if any improvements
can be made).
VCL
is not that intelligent yet (it will happily optimize complete rubbish).
For the best results it’s worth learning how to code in a VCL friendly
style, e.g.:
-
Instead of directly incrementing pointers:
lqi vector, (address++)
lqi normal, (address++)
You should use offset addressing:
lq vector, 0(address)
lq normal, 1(address)
iaddi address, address, 2
- Make
sure that all members of a vector type are accounted for, e.g.
when calculating normal lighting only the xyz part of a vector
is needed, so remember to set the w value to a constant in the
preamble, thus breaking a dependency chain that prevents VCL from
interleaving unrolled loop sections:
sub.w normal, normal,
vf00
More
of these techniques are in the VCL documentation. It’s really satisfying
to be able to cut and paste blocks of code together to get the VU
program you need and not need to worry about pairing instructions
and inserting nops.
Built-in
Profiling Registers. The EE Core, like all MIPS processors,
has a number of performance registers built into Coprocessor 0 (the
CPU control unit). The PerfTest class reads these registers and
can print out a running commentary on the efficiency of any section
of code you want to sample.
Performance
Analyzer. SCE have just announced it’s hardware Performance
Analyzer (PA). It’s a hardware device that samples activity on the
busses and produces graphs, analysis and in depth insights into
your algorithms. Currently the development support offices are being
fitted with these devices and your teams will be able to book consultation
time with them.
______________________________________________________
|