|
Introduction
Code
performance is vital to a successful game title, and poor performance
can lead to shipping delays and cutting of content as well as a much
more expensive game development process. Getting code generation right
has to start from the beginning of a project, as the architecture of a
game system determines the performance as much as anything else. A
naïve approach to architecture design will spell disaster for a
project.
Things
are only getting more complicated, since next-gen consoles have
introduced us to multiple cores, longer pipelines, Level 2 caches, and
multi-thread issues. Most next-gen consoles have chosen to pack more
computing elements onto the chips than most workstations. This comes at
the price of the omission of instruction renaming and large caches
intended to make legacy applications run faster.
The
plus side is that next-gen consoles are capable of hundreds of
gigaflops, or about a thousand Cray-1s, the most powerful supercomputer
available in 1976 when I was first learning to code. This article
illustrates a few of the key elements that lead to high performance
code in the system as a whole.
Memory Access
The
most important optimization you can make is to reduce the size of your
data. A typical title will spend a significant period of its time
waiting for Level 1 and Level 2 cache misses. Level 1 misses are bad,
Level 2 misses can be catastrophic.
A
typical Level 1 cache miss on a modern processor will waste tens of
cycles, whereas a Level 2 miss will waste hundreds. It is very
important, therefore, that the game state is kept to under the size of
the Level 2 cache, otherwise the cache will thrash on every frame and
performance will plummet. The drop in performance will be sudden and
unexplained, but the cause will be that the code is touching too much
data in a frame. If the code touches 511k of Level 2 cache in a frame
on a machine that has 512k, it will run fine. As soon as it increases
to more than 512k, the LRU mechanism will cause every access to fail. This was not the case on previous generation consoles that had only Level 1 caches with low cache miss penalties.
Use
of bit fields, carefully designed access methods and heavy duty
pre-processing of game data will significantly reduce game state size.
Memory with larger scope, like collision plane information or AI route
finding should be encouraged to access the same patches of memory on a
frame-by-frame basis. Even though the datasets may be quite large,
regular access will not thrash the cache.
Avoid
using look-up tables and constants in memory. These will cause cache
misses and will thrash Level 2 cache access. GPUs suffer from memory
access problems also. Typically a GPU will also have a cache that
follows the same rules. Use of bytes and shorts for vertex data is
recommended and balancing texture LOD by adjusting mip-map bias is an
important tool. Use anisotropic filtering to sharpen textures instead
of positive LOD bias.
Most
advanced game systems have a method of sub-sampling the artist's
original textures to reduce the texture usage. Different scenes have
different use of the same texture, so that the texture may be loaded at
a lower resolution if the player is never going to encounter that
texture close-up. Finally, the smaller the texture pool size, the
higher will be the GPU performance. Sorting the scene by texture will
preserve GPU cache coherency.
Branch Avoidance
The
next biggest source of optimization is to re-code critical methods to
avoid branches. Included in this are virtual function calls and
pointers-to-functions. Branch miss-prediction causes pipeline stalls
and instruction look-ahead failure, which is usually much more
expensive. Templates are great ways to generate variations on methods
that can avoid virtual function calls and often function pointers
passed as constants to inline functions can be converted to direct
calls. Profile-driven branch tuning can feed-back information about
branches to an executable. Branches usually have two forms, unlikely
and likely. When a branch is first encountered by the thread, the
branch direction is hinted by the likelihood encoded in the
instruction. A profile-driven optimizer can choose branch likelihood
and re-order blocks of code to avoid loading uncommon blocks into Level
1 cache.
We can eliminate branches entirely by changing code of this kind:
if( x == 0 ) y = z;
to
y = x == 0 ? z : y;
and
if( ptr != 0 ) ptr->next = prev;
to
*( ptr != 0 ? &ptr->next : &dummy ) = prev;
With a good compiler, this will execute far faster than the code with branches that the “if” form will generate. Next-gen consoles have deep pipelines that require large uninterrupted function bodies to be able to schedule efficiently.
Inlining
After
branch avoidance, inlining can be used to remove call overhead and
increase function body size, but at the expense of extra Level 2 cache
usage. Some compilers, like the SN Systems compiler (SNC), will
auto-inline functions if requested. Even legacy code can be made to go
faster provided the scope of functions is limited by “static.”
Given
a fresh design, it is possible to arrange for all functions to be
inlinable, especially if all functions are defined in header files and
the large parts of the program are compiled as a single unit. Code like
this will actually compile and link faster than its multi-module
counterpart owing to the reduction of debug information emitted by the
compiler.
Most
compilers have n-squared dependencies on optimization, so if compile
times get high, some bigger blocks may need to be split or lower
optimization levels may have to be used. Another benefit of inlining is
in the field of alias analysis which will help the compiler to remove
unnecessary loads and stores and to lengthen the blocks available to
the scheduler. Constraints on register usage are eliminated by
in-lining.
Floating Point Pipeline Usage
The
current generation of processors have long floating point pipelines
that may allow out-of-order completion. This allows faster clock rates
because the number of gates that the pipeline travels through can be
reduced if an operation is spread over several cycles. The price that
we pay for this is that the results of floating point operations are
not available for tens of cycles. It is important therefore to avoid
using floating point results for as long as possible and to make
function bodies that use floating point code as large as possible with
no branches.
In
particular, be careful when converting the result of a floating point
operation to an integer. The pipeline will have to be flushed, causing
big delays.
Vector Units and Supervectors
Ideally,
all maths must be done using the vector units available either on main
processors or on coprocessors. Unfortunately, retro-fitting a vector
math library to float-based code is rarely effective, resulting in only
a few percent improvement to critical functions. A scalar type should
be included in a vector math library and used extensively in the place
of “float” arithmetic in cases where the cost of copying from scalar to
vector is prohibitive. One way of getting good use out of your vector
units is to use the concept of Supervectors. Instead of vectors of four floats, operate on units of sixteen or more to absorb the latency between instructions.
Instead of:
void Add( vector &elem, vector a, vector b )
{
elem = vec_add( a, b );
}
Write:
void Add( vector elem[], vector a[], vector b[] )
{
elem[ 0 ] = vec_add( a[ 0 ], b[ 0 ] );
elem[ 1 ] = vec_add( a[ 1 ], b[ 1 ] );
elem[ 2 ] = vec_add( a[ 2 ], b[ 2 ] );
elem[ 3 ] = vec_add( a[ 3 ], b[ 3 ] );
elem[ 4 ] = vec_add( a[ 4 ], b[ 4 ] );
elem[ 5 ] = vec_add( a[ 5 ], b[ 5 ] );
elem[ 6 ] = vec_add( a[ 6 ], b[ 6 ] );
elem[ 7 ] = vec_add( a[ 7 ], b[ 7 ] );
}
The
second function will take about the same time to complete as the first,
because of FPU latency, but will do eight times the work. Note that use
of the __restrict keyword may be necessary if the function is not
inlined to help alias analysis. This is an important numeric
optimization. The general rule of thumb is to do the same thing multiple times. Pipelines work best when the same input instruction is used during the lifetime of the pipeline operation.
Next-gen
coprocessors have long latencies, but large numbers of available
registers which can be exploited to fill the gaps caused by
read-after-write latency issues. Structures of arrays methods
can be used to apply regular arithmetic to vector units where
Supervectors are treated like scalars. With SOA methods, values for
many objects are represented as arrays with the x, y and z components
of a vector, for example, all being elements in three separate
Supervectors. It is important to batch up numerical operations
so that they can be performed together. For example, if we have a
particle system that adds the gravity value to the velocity, the
gravity value needs only to be loaded once if all the particles of the
same type are evolved together in one loop.
If
spare GPU power is available, simple linear tasks can be run on the GPU
also, although unless the result is directly used for rendering alone,
synchronization can be an issue. Again it is essential to batch up
tasks that are similar.
Thread Management
First-timers
to embedded multithread environments will usually make the mistake of
spawning too many threads. It is important to spawn only as many
threads as there are hardware resources to execute them, otherwise the
operating system will thrash the threads and take up a considerable
overhead. Although multiple threads can be created on a single core,
they will often fight for resources, negating the advantage of having
them. Care should be taken to schedule complimentary tasks into
threads. For example, mix a memory-intensive task with a computation
intensive task, so that they do not compete for memory resources. Spawn
the threads in the game initialization code; it is not wise to attempt
to spawn threads in the game loop as this is an expensive operation.
Profiling the game code early in the development cycle can spot
operating system overheads before they become serious.
Next-gen engines may have an Execution Plan
where tasks are given fixed timeslots to complete in with the timeslots
organized so that the outputs of one task feed the inputs of another.
Failure of a task to complete in time may cause an assertion failure in
pre-production builds with the short-term irritation tolerated to
prevent last-minute optimization panics and slipped deadlines.
Execution plans can be represented as XML files and tie nicely into
systems like the Collada scene description language. Careful design of
the execution plan for a scene or sub-scene is essential. If the
resources available in a scene are finite, then it is possible to
guarantee frame rate without over-allocation.
Using
multiple threads on one core reduces effective latencies as
instructions are executed alternately. A good compiler will have an
optimization option to support this.
Serialization and Pre-compilation
Serialization
in-game, which causes the long load times common on PC games, can be
avoided by pre-serializing game state data and constants in the game
compilation phase of the tool-chain. Binary images of structures load
much faster as no code needs to be executed during load. Seamless scene
changes are possible as code and data resources for a scene can be
loaded without the use of precious CPU cycles. On consoles, this is the
preferred method.
Games
usually load from DVD, which has little tolerance for seeking. Large
game projects usually have banks of servers to execute game-specific
processes like lightmapping, collision data building, and AI
routefinding table construction. There are a number of distributed
build systems available to help developers build large code and data
systems.
For
example, it is possible to convert game data into binary images of
in-game structures that can be loaded in seconds in megabyte-sized
blocks direct from DVD. Often for DVD efficiency, data are duplicated
to avoid doing seeks. It is a false economy to store a texture only
once on a DVD, for example, as this will cause game loading times to
multiply.
Memory Management
High-performance games do their own memory management. Calling malloc or the default new
in a game loop is considered irresponsible. Memory fragmentation is the
biggest cause of console game crashes and on PC games will cause
slow-downs after periods of gameplay. Heap management systems range
from simple “hunk” allocators to systems that sort block sizes by
location in memory and can allocate from pools in a few cycles. Debug
information is usually a part of heap management. It is useful to see
which systems are allocating blocks of memory.
Avoiding Ransom Blocks
which prevent the merging of two large memory blocks is important. A
scene should be given its own memory area to allocate from. This will
mean that when the scene is torn down, the entire area can be freed
without fear of fragmentation.
Conclusions
Next-gen
consoles are very sensitive to poor programming practice and require
nice long function bodies with no branches in order to be able to
optimize the code effectively. The main CPU, the coprocessors and the
GPU all need careful planning to avoid thrashing of caches, but if they
are treated well, huge performance benefits are possible, far more than
can be achieved with PC games. “The bigger, the better” is the motto
for vectorization. Don't limit yourself to the four floats available in
SIMD implementations - sixteen or thirty-two float Supervectors are much better to soak up the stalls on read-after-write dependencies.
Careful thread management is essential to reduce latency and avoid competition for resources. The use of an Execution Plan is a wise idea to tie together macroscopic components of the system.
_____________________________________________________
|