You
can use OpenMP* to add thread level parallelism to your application by
adding special OpenMP* compiler directives to you source code. These
directives come as pragmas that do not change the semantics of your
initial code and are therefore non-intrusive. As you will see they are
easy to add and you can use them to incrementally parallelize your
code. Of course you need a compiler that supports OpenMP* but luckily
VS2005* and the also the Intel C/C++ compiler both support it. Source
code with OpenMP* pragmas is portable to the Xbox360 because the
Microsoft* compiler supports it there as well. Still, this article is
not meant to be a primer on OpenMP* - please see [OpenMP] for an
in-depth introduction on it.
The
main usage model for OpenMP* is a fork and join multi-threading, which
means that a set of threads fork from the main execution flow and work
together on a shared set of tasks. After having finished their work
they all join again. Since OpenMP* uses an internal thread pool there
is no thread creation or cleanup overhead.
For the purpose of the multi-threaded tessellation a data parallel OpenMP* pragma is used to parallelize a for()-loop. As indicated by Code Fragment 2 OpenMP* is told work on N
tessellation workloads in parallel. The OpenMP* runtime will decide how
many threads it will use to do this work. By default it is the number
of hardware threads supported by your machine. It is possible though to
change this through OpenMP* library calls.
Code Fragment 2
Figure 2 shows what would happen on a machine capable of four hardware threads – please note that the MainThread
is also working with the other threads. Also note that the main thread
does the culling and workload distribution and also the drawing.
If
you start the demo application it starts in a mode that uses OpenMP* to
do a data parallel tessellation of the tessellation workload as just
described. A slider in the demo can be used to tell OpenMP* to use as
many hardware threads as your machine can run in parallel.
Figure 3
Unfortunately
the heavy use of SSE on all threads does not work well with using all
logical processors of a hyperthreading system, and will even result in
a slowdown. D3D* and the graphics driver which run on the main thread
also make use of the SSE units. If you also wanted to use all logical
processors and gain a speedup you would have to write additional
tessellation code that does not use the SSE units at all. The demo can
use affinity masks to try to make sure that only one of the two logical
processors of an HT core will be used for tessellation (see below).
Still if you ever get hold of a real 4 core machine the demo allows you
to use them.
To prove that the demo can really reach a speedup on a dual core machine do the following:
Make sure that the device settings indicate that vertical syncing is off
Select the number of threads to be used to one
Tick ‘Use OpenMP’
Increase
the viewing distance until you go down to 60 FPS. It is assumed that
your graphics card is fast enough to run the initial settings at over
60 FPS.
Increase the number of threads to be used to two.
You
should see the frame rate go up again, obviously only if you really
have a two core machine. If there is no speedup or almost no speedup,
then the tessellation workload is not the limiting factor. Most likely
your graphics card is then transform or memory (transfer) limited which
means rendering is relatively expensive. To check this you can un-tick
the ‘Tessllation running’ box. After that you should see how fast your
card can draw the vertex load generated by the tessellation.
You
will have noticed that the speedup is not necessarily very high.
Depending on you system and your graphics card you can get an increase
of frame rate from e.g. 60 to say 75 FPS which would be a speedup of
roughly 25%. Again, how much speedup you get is determined by how fast
your system can render the tessellated scene. If rendering cost is
small compared to the tessellation cost the speedup gained with OpenMP*
can be higher. One test-system I used produced a 50% speedup.
If
you bring up the Windows* task manager it becomes apparent that OpenMP*
does not use affinity masks to try to lock threads on certain cores or
processors. You will see that Windows* reschedules the tessellation
threads trying to minimize core utilization. For our purpose this is
not too bad but it might be worth trying to bind threads to certain
cores.
The
reason why one can’t get a higher speedup on certain systems (where
rendering is relatively expensive) is that the time the tessellation
work takes does represent a relatively small percentage of the overall
frame processor load on these systems. Culling, workload distribution
and mainly rendering are taking most of the time. This is not
necessarily a problem and actually can be predicted by Amdahl’s Law
(see [DevMTApps]). This law in a nutshell states that the maximum
parallel speedup one can reach is limited by the serial portion of your
code. Since the rendering is done in just one thread it limits the
speedup. Still it is possible to reach higher frame rates by decoupling
tessellation work from rendering. How this can be done is discussed
next.
Asynchronous Multi-Threaded Tessellation
To
reach a maximum frame rate on systems where the rendering cost is high
when compared to tessellation one ideally wants to completely decouple
rendering from culling and tessellation. The basic idea is to only pick
up a new terrain tessellation when it is done. To cope with camera
movements a triangle strip for an enlarged view frustum can be
generated. The demo does not do this. You will thus notice that for
fast rotations there simply is no terrain available for a short moment
in time.
The
asynchronous threading architecture that is realized in the demo
(activated if you un-tick ‘Use OpenMP’) is shown in Figure 3. For this
architecture one needs two vertex buffers that are used alternatively.
One vertex buffer is rendered by the main thread. The other vertex
buffer is asynchronously filled by the tessellation threads. The main
thread checks every frame if a new tessellation is available. If it is
available, it from then on uses the new vertex buffer to be drawn. It
then locks the other (old) vertex buffer and hands it off to the
tessellation threads to fill it. This is done in a round robin fashion.
The
synchronization of the threads is handled using Windows* events. One
event is used to signal the main tessellation thread that it should
start a new tessellation. The main tessellation thread uses yet another
event to signal to the main thread that a new tessellation is
available. The main tessellation thread itself first does the culling
and the workload distribution. After that it signals a set of events
that will kick off additional tessellation threads. That is, if there
are more than two cores in your system. The additional tessellation
threads will work along with the main tessellation thread to finish the
tessellation. Each additional tessellation thread signals the main
tessellation thread when it has finished its job by setting its own
event.
Figure 4
The main tessellation thread does a WaitForMultipleObjects() to wait for all its siblings to finish before signaling the main thread.
The
demo application actually initially does not run completely
asynchronously but the main thread waits until the last tessellation
has been done by the tessellation threads kicked off last frame.
Interestingly you will still see frames with an incomplete terrain. The
reason for this is that the main tessellation thread has picked up a
view cone for culling that is not the same used by the main thread when
drawing the actual frame. This can be rectified if we accept a one
frame lag.
You
can now switch to fully asynchronous mode if you un-tick ‘Wait for
tessellation’. In this case the main thread will only use a new
tessellation when it is done.
All
threads used by the tessellation are created at the startup of the
demo, so no thread creation or cleanup is going on while the demo is
running. In addition to that, all threads including the main thread can
be affinity-bound to exactly one logical processor of one of the cores
by setting the appropriate affinity masks for them. This has been done
to enable the use of the Windows* task manager to really see how much
processor time is spent for tessellation and on each core – that is if
Windows* really respects the affinity masks.
The
slider for the number of threads is used differently when un-ticking
‘Use OpenMP’ and running asynchronously. It specifies the number of
threads, including the main tessellation thread, to be used for the
asynchronous tessellation. On a two core machine it should be left at a
value of one.
Compared
to the OpenMP* mode you should now, using the same viewing distance and
tessellation settings, see a much higher frame rate if the render cost
is high when compared to the tessellation cost. If rendering is cheap
when compared to tessellation you will see a smaller speedup than with
OpenMP*. Depending on your machine this means you can let the player
look even further or increase the quality of the tessellation. If you
un-tick ‘Wait for tessellation’ the frame rate you see is independent
of the complexity of the tessellation workload. It should be the same
that you see when un-ticking ‘Tessellation running’.
The Demo
The
source code for the demo (see Figure 5) is available for download, so
everybody can have a look. The culling code that has been implemented
is far from optimal, but you may stick your own culling code into the
sample.
The
demo pre-computes a grid of patches from a height field in memory. It
would be easy to change the code to work on a height field that is
synthesized on-the-fly and does not sit in memory at all. Also it is
probably also easy to port the SSE intrinsics to appropriate code for
the vector units of the new consoles.
In
addition to the tessellation code, you will find the source code for a
library that implements CPU detection (written by my colleague Leigh
Davies). The CPU detection library enumerates cores and logical
processors which enables you to detect which logical processor is a HT
core. Please note that the CPU detection code is supposed to work on
all IA32* PC processors not only on Intel processors.
Conclusion
This
article has described how to multi-thread terrain smoothing in a
scalable way. The tessellation will be faster with every core you allow
the code to use. Initial performance tests indicate that the OpenMP*
code path can tessellate and display a terrain with around 20-40
million vertices a second on a dual core processor system. Further the
graphics card that has been used could draw tessellated terrain from
dynamic vertex buffers at roughly 70 million vertices a second. This
indicates that additional cores can be successfully used to do dynamic
terrain tessellation and generate other dynamic geometry generation
like procedural plants. Just imagine a forest with trees that all look
different. Furthermore it has been shown that additional cores can be
used to offload tasks from the graphics card. The graphics card would
otherwise have to do terrain tessellation in addition to what it has to
do anyway. It seems as if the new consoles have even more efficient
ways to push dynamic geometry to the graphics card (see [Stokes05]), so
the approach described in this article could probably be applied very
successfully.
[D3D05] DirectX9c December SDK – Available online from www.microsoft.com
[Farin96] Farin, Gerald E. “Curves and Surfaces for Computer-Aided
Geometric Design“Academic Press Inc. (London) Ltd (8. Oktober 1996)
[Foley90] Foley James D., van Dam Andries, Feiner. Steven K., Hughes John F. ,”Computer Graphics”, Addison Wesley 1990
[Gruen05]
– Gruen Holger, “Efficient Tessellation on the GPU through Instancing”,
Journal Of Game Development Volume 1, Issue 3, Thomson Delmar Learning,
December 2005
[Bunnell05]
Bunnell Michael, “Adaptive Tessellation of Subdivision Surfaces with
Displacement Mapping“, GPU Gems II, Addison Wesley 2005