|
Introduction
There is “A Fundamental Turn Toward Concurrency in Software,” or so says the title of a recent article by Herb Sutter in Dr. Dobb's Journal.
This article is prompted by the emergence of dual-core processors and
the expectation that multi-core processors will eventually become the
norm. Modern computer games stress the limits of current desktop
platforms so game developers will likely be early adopters of dual-core
and multi-core technology.
Threading Prince of Persia: Warrior Within
for GDC 2005, Martin Sevigny, Ubisoft Technical Director, had this to
say about threading: “Dual-core is going to help us produce much more
‘living' games with more animation, AI, physics, all sorts of cool
stuff that we were not able to do before.” The following article is not
a case study of a particular game. Rather, it is intended as a generic
study based on experience with real games. The goal is to present some
of the pitfalls to threading games along with possible solutions.
Defining the Problem
The
inherent complexity of game engines makes them more difficult to thread
than other applications. Various subsystems [e.g., artificial
intelligence (AI), physics, character animation, collision detection,
rendering, audio, etc.] manage aspects of the game engine. As a player
moves through the virtual world of the game, a video forms. However,
there is no predefined sequence of events. The actions of the player
are effectively random from the game's perspective. The game is a
real-time, event-driven process. This creates a problem for threading –
the frame sequence is inherently sequential. In other words, each frame
depends on the preceding frame. Therefore, multiple frames cannot be
computed independently and in parallel to boost frame rates and improve
performance. Techniques to predict more than a frame in advance would
introduce latency and negatively impact the player experience.
Since
it is impossible to exploit inter-frame parallelism, one must look
within a frame for potential threading targets. Simultaneously
computing the AI, character animation, collision detection, etc. would
be ideal. However, most of today's engines were designed without
considering parallel computation. As a result, dependencies exist among
the subsystems that limit the effectiveness of threading (Figure 1).
Here are just a few examples:
- Some characters cannot move until the AI subsystem tells them what to do.
- Sounds (e.g., foot steps, gunshots, speech, etc.) cannot be emitted until all events and actions for the frame are decided.
- Collision detection is a continuous process that can interrupt other computations.
- Rendering cannot occur until the final state of the frame is decided.
|
|
|
 |
 |
 |
Figure 1.
This graph gives an example of the interaction of components in a game
engine. Notice the dependencies that exist between components within
the engine itself: each subsystem can receive information from any
other subsystem to affect results both within a frame or in future
frames.
|
The
extensive inter-frame and intra-frame dependencies described above
limit the opportunities to exploit parallel computing. Inter-frame
dependence means that multiple frames cannot be computed in parallel.
Within a frame, dependencies between subsystems (e.g., AI and character
animation) mean that one must go deeper into the code to find threading
opportunities. This has two disadvantages, but it is necessary to
define a couple of parallel computing concepts before discussing these
disadvantages.
Amdahl's Law
Parallel
speedup is limited by the amount of serial computation. This simple
concept, known as Amdahl's Law, is illustrated in Figure 2. If 50% of
an application is inherently serial, parallelizing the other 50% yields
a maximum theoretical speedup of 1.3 on two processors. The frame
calculation shown in Figure 2 is an over-simplification but it
illustrates how Amdahl's Law can impact intra-frame parallelism. Even
if each subsystem can be parallelized effectively, the work of the
audio and graphics subsystems is still a serial bottleneck; the slower
the subsystem, the worse the bottleneck. The examples in Figure 2 also
assume perfect parallelism. The cost of thread creation,
synchronization, and other parallel overhead is ignored. Maximum
speedup will be lower if these factors are considered. Tools exist to
assist in determining bottlenecks in your application. One such example
is Intel's VTune Performance Analyzer, which can be used to identify
application bottlenecks and suggest additional performance
opportunities.
 |
Amdahl's Law
|
| T p |
Parallel runtime
|
| Ts |
Serial runtime
|
| %S |
Percentage of time spent in serial code
|
| N |
Number of processors
|
|
Task
|
Fraction
|
Serial or Parallel
|
 |
|
AI
|
0.1
|
Parallel
|
|
Character animation
|
0.2
|
Parallel
|
|
Particle dynamics
|
0.1
|
Parallel
|
|
Physics
|
0.2
|
Parallel
|
|
Collision detection
|
0.1
|
Parallel
|
|
Audio
|
0.1
|
Serial
|
|
Video
|
0.2
|
Serial
|
|
|
Figure 2.
According to Amdahl's Law, parallel speedup is limited by the amount of
serial computation. The simple diagram at top left shows that if only
50% of the total runtime is spent in parallel execution, the maximum
theoretical speedup is only 1.3 on two processors.
The
example frame computation assumes that each subsystem is threaded but
the audio and video computations are strictly serial. Even though only
30% of the total frame computation is serial, the maximum theoretical
speedup is only 1.5. Also note that this example assumes perfect
parallelism in each engine. Parallel overhead is ignored.
|
Granularity
Granularity
is another important concept in parallel computing. It is loosely
defined as the amount of work per parallel task. Unlike Amdahl's Law,
there is no equation to determine granularity. It is a relative and
somewhat subjective metric. When dividing a problem into independent
tasks, a fine-grained decomposition produces a large number of small
tasks. A coarse-grained decomposition produces a smaller number of
large tasks. For threading to be beneficial, the amount of work per
thread must be larger than the parallel overhead associated with the
thread. Therefore, coarse granularity usually yields better parallel
performance.
Attempts
to decompose a 3D game engine into independent tasks have, so far,
produced successively finer granularity. Dividing the frames among
threads gives the coarsest granularity, but the video sequence evolves
over time in this type of game, and time is inherently sequential.
Within a frame there are a number of tasks that could be computed in
parallel except for the intricate pattern of dependencies (Figure 1).
The only option remaining is to thread within the individual subsystems
required to calculate each frame. This grain size may be too fine for
threading to benefit some games.
As
a practical example, consider that a typical 3D game engine outputs
many frames per second (FPS). At 30 FPS, each frame takes only 0.0333
seconds to compute. Let's assume that the particle dynamics subsystem
can be effectively parallelize and that it accounts for 5% of each
frame computation (0.0017 seconds). A computation that takes so little
time is unlikely to benefit from parallelism after considering the
overhead of thread creation, scheduling, synchronization, and
management.
Solving the Problem
Amdahl's
Law and fine granularity will limit the benefit of threading in many 3D
game engines. However, increasing the amount of work per frame can
lessen their impact. For example, adding more visual effects,
increasing the number of characters, and using a smarter but more
compute-intensive AI not only improves game play, it can improve the
ratio of time spent in parallel versus serial computation. It also
increases the amount of work per thread, thus improving granularity.
This
solution is relatively easy to implement, and by taking advantage of a
second processor, it is possible to improve game play without
sacrificing frame rates. However, this is hardly a perfect solution. It
is not scalable because the effects of Amdahl's Law and fine
granularity cannot be completely eliminated. The audio and graphics
subsystems still constitute a serial bottleneck. Also, adding more
computation may not be appropriate for games that target consoles with
limited resources.
Redesigning
game engines to take advantage of multiple processors is another
alternative. An engine designed with parallelism in mind should scale
better than a legacy engine retrofitted with threads. However,
developing new games engines is time-consuming. Also, it is not
possible to completely eliminate the dependencies between subsystems
(Figure 1). However, it might be possible to perform these computations
asynchronously, as shown in Figure 3.
In
this model, each subsystem performs its computations continuously with
the latest information available. This would permit each subsystem to
be updated on different threads with frequencies independent of the
render loop. Dependencies still exist. For example, character animation
still requires input from the AI. Rather than sit idle waiting for AI
to complete its current calculation, however, the animation subsystem
grabs the latest AI update. A separate thread periodically sends the
latest frame to be rendered. (For more details and an example
implementation using a simple collision detection system, see the
references in the Suggested Reading at the end of this article.)
This
method is not without disadvantages. First, the threads will
occasionally need to access shared data. Data sharing could limit the
effectiveness of this method depending on the level of synchronization
required. Second, an asynchronous engine is definitely more complex
than existing sequential engines. Designing an asynchronous engine will
require some experimentation and calibration. Some subsystems take
longer to compute than others. Some subsystems must be updated more
frequently than others. It could also be advantageous to map more than
one subsystem to the same thread, especially if those subsystems are
tightly coupled. Fortunately, the industry is developing tools to
assist with writing parallel code. For example, the Intel® Thread
Checker identifies race conditions, such as storage conflicts and
deadlocks, and other thread-specific problems that affect program
correctness before they happen. The errors reported by the Intel®
Thread Checker often indicate sections of code that require
synchronization. For example, two threads simultaneously writing to the
same memory location can cause data corruption. This type of race
condition, known as a storage conflict, is prevented by synchronizing
access to the shared variable or by giving each thread a private copy
of the variable.
After
the threading problems are fixed and the code is working correctly,
performance improvements should be validated. For this, a developer can
use the internal performance profilers to validate that the new code is
providing the benefits expected. If these internal tools do not exist
or a developer wishes to further validate the changes, the developer
can use the Intel® Thread Profiler, to identify performance problems
such as lock contention, load imbalance, excessive system overhead,
etc., that limit multithreaded performance. Once identified, there are
several common techniques to address parallel performance problems. For
example, making some variables or data structures private to each
thread can eliminate the need for synchronization.
Figure
3 illustrates the importance and difficulty of calibrating the
computation frequency of each subsystem. The collision detection (CD)
subsystem is mapped to thread-3 (T3). When the second CD computation
begins (indicated by the first dotted blue line), the character
animation (CA) is still in progress. Therefore, the first CD
computation is still valid. Performing the second CD computation wastes
processor resources and provides no additional information. New CA
information is available by the third CD computation (indicated by the
second dotted blue line), even though AI is still in progress.
Similarly, the first and second render computations proceed with
updated CA information but the third render computation does not.
Without updated CA information, is rendering a new frame necessary?
These examples show how the computation frequency of each subsystem can
impact parallel efficiency.
|
|
|
 |
 |
 |
Figure 3.
In this model, the artificial intelligence (AI), character animation
(CA), and collision detection (CD) subsystems are mapped to separate
threads that run asynchronously. A fourth thread periodically gathers
the latest frame information and sends it off to be rendered.
|
Conclusions
Threading
real-time, event-driven games to take advantage of multiple processors
is a complicated task. Retrofitting a legacy game engine with threads
is possible but parallel performance may vary widely. Redesigning game
engines with parallelism in mind is a better long-term solution.
Suggested Reading
_____________________________________________________
|