Multi-core
processors are going mainstream. Most PCs sold today and also all new
game consoles allow the parallel execution of several hardware threads.
For games programmers it would be a real shame to have all this
additional computational power available and not use it. The new
consoles do not seem to have out-of-order cores (see [Stokes05]).
Threading is therefore the only way to really use all available
execution units in parallel.
Creating
a multi-threaded game engine is a challenging task (see e.g. [Gabb05]).
Various game tasks like rendering, Artificial Intelligence (AI)
computations, physics or collision detection often rely on the
sequential availability of results from other game tasks. If game
engine developers choose to run several different game tasks, like
physics and rendering, in parallel (task level parallelism) it can be
complicated to have required results from other tasks available in
time. After all one does not want to add too much synchronization
overhead. The other possibility is to work on several problems of the
same kind in a data parallel way (e.g. path finding for 4 characters in
parallel). To be as efficient and as scalable as possible your game
engine should ideally be implemented using a mixture of task level and
data level parallelism. This guarantees scalability for more cores to
be available.
Concerning
multi-threaded software development there are nice documents available
for download that describe all relevant concepts and also most coding
pitfalls (see e.g. [DevMTApps]). Also it should be noted that tools
like VTune* (for finding hotspots), ThreadCecker* (for finding race
conditions and other threading bugs) and ThreadProfiler* (for finding
load balancing and other performance issues) are valuable tools to help
your multi-threaded development effort (see [IntelTools]). The
threading problems you solve on your PC with their help will most
likely help you on other multi-core platforms as well.
[West06]
has written an introduction presenting ideas of how to take advantage
of multiple cores in game engines. This article takes his approach one
step ahead and describes the use of multi-threading to realize
real-time terrain height field smoothing (see Figure 1). The visible
part of the terrain is smoothed in real-time every frame to achieve
smooth LOD transitioning without visible geo-morphing or popping. Also
this way you make sure that you spend your vertex budget for the
terrain in an almost optimal way.
►
Figure 1
It
has to be mentioned that the terrain smoothing described here could
also be done on the graphics card using floating point render targets
(see [Bunnell05]) or ‘Tessellation through Instancing’ (see [Gruen05]).
Also DirectX10* and GeometryShaders are just around the corner. Still,
most games are processor limited on low to medium display resolutions,
so offloading the processor tasks to other cores can help. Again most
games are graphics-card-limited at higher display resolutions. So with
higher display resolution it makes sense to shift load away from the
graphics card as you might desperately need your graphics card power
for other things. Of course all this depends on how you can balance
your processor and graphics card workloads. If you do rendering in
multiple passes on pre DirectX10* hardware, tessellation on the
processor can be faster than doing it potentially many times on the
graphics card.
The
use of OpenMP* (see [OpenMP]) to realize data parallel smoothing is
investigated as a first approach. Since it is harder to use OpenMP* to
realize asynchronous threading it is also shown how to use the Windows*
threading API to realize an asynchronous tessellation that only picks
up a new terrain mesh when it is done. This latter approach has no
impact on the frame time at all. The algorithm described can generate
balanced workloads for even a big number of cores and therefore scales
nicely. You can use it to let just one thread or one free core do the
heavy lifting of smoothing or distribute smaller workloads to cores
that you know are not fully utilized.
Next
this article discusses how a sparse terrain height field can be turned
into an overall smooth surface using Bezier patches. After that it is
described how culling, LOD and thread workload selection can be
realized and how a big and watertight mesh can be generated. Finally
the discussion on how to multi-thread the smoothing work concludes the
article.
Terrain Smoothing
For
simplicity reasons this article chooses one dimensional rectangular
bi-cubic Bezier patches to smooth a sparse height field. These patches
are only used to smooth the height values across the height field, the
other coordinates are generated by linear interpolation. The following
text assumes, because of brevity reasons, that the reader is familiar
with how Bezier patches work. See [Farin96] for a well written
introduction.
Figure
1 shows a 4 by 4 neighborhood within a bigger height field. The left
and bottom corner of this neighborhood is at location (x,y)
(both are integer values) in the height field. The height values,
indicated by thin vertical lines, are meant to be different but are
shown as roughly equal to make the drawing more readable. The patch p shown at the center of the height field is a representative the grid cells defined by the height field. The grid cell for p is defined by h(x+1,y+1), h(x+2,y+1), h(x+1,y+2) and h(x+2,y+2).
In order to replace p by a bi-cubic Bezier patchone needs to define the 16 control heights. These control heights are indicated by b00, b10, b20, b30, b01, b11, b21, b31, b02, b12, b22, b32, b03, b13, b23 and b33.
Figure 2
Because
of brevity reasons – after all the focus of this article is on
multi-threading – Formula 1 simply presents one algorithm to setup the
control heights to achieve an overall C1 smooth terrain surface.
Formula 1
How
does one now evaluate the surface? The key insight is that each patch
defines four vertical Bezier curves defined by four control heights
each. The curves are:
Curve 0 : Defined by b00, b01, b02 and b03
Curve 1 : Defined by b10, b11, b12 and b13
Curve 2 : Defined by b20, b21, b22 and b23
Curve 3 : Defined by b30, b31, b32 and b33
Each
curve can be written as a cubic polynomial with coefficients that can
be derived from the control heights (see [Foley90]). Forward
differencing (see [Foley90]) can be used to step through each of the
curves in constant steps. The implementation of the demo for this
article does this with SSE (see e.g. [Klimovitski01]) compiler
intrinsics. It actually steps all four curves simultaneously using
instruction level parallelism. For each step, forward differencing
gives us four height values. These four height values are again used as
four control heights which now define a curve that runs across the
patch in the horizontal direction. Again forward differencing is used
to step along this horizontal curve. This time SSE is used to transform
the three adds of an FFD step to one 4 dimensional add and one shuffle
operation. Directional derivatives are stepped in a similar manner. It
has to be noted though that the code only generates approximate
derivates along the vertical direction.
Culling and LOD Selection
Now
assume one has some efficient culling algorithm that quickly identifies
the cells of the height field inside the current view frustum. The
output of this culling algorithm is a list of potentially visible grid
cells.
The
goal is to tessellate patches near the viewer with more details and
patches far away from the viewer with less detail. The implementation
in the demo chooses a level of detail (LOD) value for every grid cell.
This LOD value is based on the distance of the center of the grid cell
from the viewer. The LOD is chosen as a floating point number that
defines how many points the FFD algorithm generates along each edge of
the patch. The non-integral part of the LOD is defined in the same way
that D3D defines segment counts for curved surface patches (see
[D3D05]). This ensures that a moving camera generates a very smooth
view dependent LOD transitioning since there is no popping of discrete
LODs.
The
demo generates one big triangle strip for the whole of the tessellated
terrain. It also generates extra triangles that stitch the gaps between
neighboring patches. Using an LOD per edge as [D3D05] does would be
more advanced because stitching triangle would not be needed. The
per-cell LOD approach was used because the setup for forward
differencing is easier to code this way.
Now
that the smoothing mathematics, culling and mesh generation have been
covered the text moves on to describe how to generate a balanced
workload for multiple threads to do the actual tessellation work.
Creating a Balanced Workload for Multi-Threading
The
goal is to distribute the tessellation computations collected by the
culling algorithm evenly to a set of hardware threads. To distribute an
even workload one needs some value that specifies how complex a certain
tessellation task is. Luckily the LOD value of each grid cell describes
how expensive it will be to tessellate the corresponding patch. What is
also nice is that the LOD values usually fall into a very limited
range. In the demo implementation LODs range from 2.0 to 30.0.
Therefore
an algorithm collects all cells with the same integral LOD into a
separate list and essentially sorts patches to be tessellated into
buckets with roughly the same tessellation work to be done. Given a
number of worker threads the code generates an initially empty workload
list for every thread. Now starting from the buckets with the most
complex workload, in a round robin fashion, the workload is distributed
to the threads. This process is outlined by the following pseudo code
Code Fragment 1.
int maxworkload = 0;
int threadindex = 0;
set current bucket to buckets with highest LOD patches;
while( current bucket not empty )
{
Remove patch from current bucket;
Add patch to workload of thread[ threadindex ];
if( workload of thread[ threadindex ] > maxworkload )
{
// update max workload
maxworkload = workload of thread[ threadindex ];
// next thread in a round robin fashion
threadindex = ( threadindex + 1 ) % NumThreads;
}
if( current bucket empty )
{
Choose next non empty bucket with smaller workloads;
If not found break;
}
}
Code Fragment 1
You
can further enhance this algorithm by adding a per thread value that is
used to weight the current thread workload when comparing it with the
current maximum workload. This allows the generation of controlled
uneven workloads with the same algorithm.
From
the LOD value associated with every cell it is also very easy to
compute how many vertices will be produced by a certain tessellation
workload. When offloading workloads to several threads each thread will
also be given an offset along with the starting address of a vertex
buffer that has been locked by the main thread of the application. This
way threads can write their tessellation results into non-overlapping
memory regions.
It
has been shown how to generate a workload distribution for a number of
threads. Next it will be discussed how to architect multi-threaded code
that picks up these workloads. First the use of OpenMP* is discussed
and afterwards the use of the Windows* threading API is described.