There is “A Fundamental Turn Toward Concurrency in Software,” or so says the title of a recent article by Herb Sutter in Dr. Dobb's Journal. This article is prompted by the emergence of dual-core processors and the expectation that multi-core processors will eventually become the norm. Modern computer games stress the limits of current desktop platforms so game developers will likely be early adopters of dual-core and multi-core technology.
Threading Prince of Persia: Warrior Within for GDC 2005, Martin Sevigny, Ubisoft Technical Director, had this to say about threading: “Dual-core is going to help us produce much more ‘living' games with more animation, AI, physics, all sorts of cool stuff that we were not able to do before.” The following article is not a case study of a particular game. Rather, it is intended as a generic study based on experience with real games. The goal is to present some of the pitfalls to threading games along with possible solutions.
Defining the Problem
The inherent complexity of game engines makes them more difficult to thread than other applications. Various subsystems [e.g., artificial intelligence (AI), physics, character animation, collision detection, rendering, audio, etc.] manage aspects of the game engine. As a player moves through the virtual world of the game, a video forms. However, there is no predefined sequence of events. The actions of the player are effectively random from the game's perspective. The game is a real-time, event-driven process. This creates a problem for threading – the frame sequence is inherently sequential. In other words, each frame depends on the preceding frame. Therefore, multiple frames cannot be computed independently and in parallel to boost frame rates and improve performance. Techniques to predict more than a frame in advance would introduce latency and negatively impact the player experience.
Since it is impossible to exploit inter-frame parallelism, one must look within a frame for potential threading targets. Simultaneously computing the AI, character animation, collision detection, etc. would be ideal. However, most of today's engines were designed without considering parallel computation. As a result, dependencies exist among the subsystems that limit the effectiveness of threading (Figure 1). Here are just a few examples:
|Figure 1. This graph gives an example of the interaction of components in a game engine. Notice the dependencies that exist between components within the engine itself: each subsystem can receive information from any other subsystem to affect results both within a frame or in future frames.|
The extensive inter-frame and intra-frame dependencies described above limit the opportunities to exploit parallel computing. Inter-frame dependence means that multiple frames cannot be computed in parallel. Within a frame, dependencies between subsystems (e.g., AI and character animation) mean that one must go deeper into the code to find threading opportunities. This has two disadvantages, but it is necessary to define a couple of parallel computing concepts before discussing these disadvantages.
speedup is limited by the amount of serial computation. This simple
concept, known as Amdahl's Law, is illustrated in Figure 2. If 50% of
an application is inherently serial, parallelizing the other 50% yields
a maximum theoretical speedup of 1.3 on two processors. The frame
calculation shown in Figure 2 is an over-simplification but it
illustrates how Amdahl's Law can impact intra-frame parallelism. Even
if each subsystem can be parallelized effectively, the work of the
audio and graphics subsystems is still a serial bottleneck; the slower
the subsystem, the worse the bottleneck. The examples in Figure 2 also
assume perfect parallelism. The cost of thread creation,
synchronization, and other parallel overhead is ignored. Maximum
speedup will be lower if these factors are considered. Tools exist to
assist in determining bottlenecks in your application. One such example
is Intel's VTune Performance Analyzer, which can be used to identify
application bottlenecks and suggest additional performance
Percentage of time spent in serial code
Number of processors
Figure 2. According to Amdahl's Law, parallel speedup is limited by the amount of serial computation. The simple diagram at top left shows that if only 50% of the total runtime is spent in parallel execution, the maximum theoretical speedup is only 1.3 on two processors.
The example frame computation assumes that each subsystem is threaded but the audio and video computations are strictly serial. Even though only 30% of the total frame computation is serial, the maximum theoretical speedup is only 1.5. Also note that this example assumes perfect parallelism in each engine. Parallel overhead is ignored.
Granularity is another important concept in parallel computing. It is loosely defined as the amount of work per parallel task. Unlike Amdahl's Law, there is no equation to determine granularity. It is a relative and somewhat subjective metric. When dividing a problem into independent tasks, a fine-grained decomposition produces a large number of small tasks. A coarse-grained decomposition produces a smaller number of large tasks. For threading to be beneficial, the amount of work per thread must be larger than the parallel overhead associated with the thread. Therefore, coarse granularity usually yields better parallel performance.
Attempts to decompose a 3D game engine into independent tasks have, so far, produced successively finer granularity. Dividing the frames among threads gives the coarsest granularity, but the video sequence evolves over time in this type of game, and time is inherently sequential. Within a frame there are a number of tasks that could be computed in parallel except for the intricate pattern of dependencies (Figure 1). The only option remaining is to thread within the individual subsystems required to calculate each frame. This grain size may be too fine for threading to benefit some games.
As a practical example, consider that a typical 3D game engine outputs many frames per second (FPS). At 30 FPS, each frame takes only 0.0333 seconds to compute. Let's assume that the particle dynamics subsystem can be effectively parallelize and that it accounts for 5% of each frame computation (0.0017 seconds). A computation that takes so little time is unlikely to benefit from parallelism after considering the overhead of thread creation, scheduling, synchronization, and management.
Solving the Problem
Amdahl's Law and fine granularity will limit the benefit of threading in many 3D game engines. However, increasing the amount of work per frame can lessen their impact. For example, adding more visual effects, increasing the number of characters, and using a smarter but more compute-intensive AI not only improves game play, it can improve the ratio of time spent in parallel versus serial computation. It also increases the amount of work per thread, thus improving granularity.
This solution is relatively easy to implement, and by taking advantage of a second processor, it is possible to improve game play without sacrificing frame rates. However, this is hardly a perfect solution. It is not scalable because the effects of Amdahl's Law and fine granularity cannot be completely eliminated. The audio and graphics subsystems still constitute a serial bottleneck. Also, adding more computation may not be appropriate for games that target consoles with limited resources.
Redesigning game engines to take advantage of multiple processors is another alternative. An engine designed with parallelism in mind should scale better than a legacy engine retrofitted with threads. However, developing new games engines is time-consuming. Also, it is not possible to completely eliminate the dependencies between subsystems (Figure 1). However, it might be possible to perform these computations asynchronously, as shown in Figure 3.
In this model, each subsystem performs its computations continuously with the latest information available. This would permit each subsystem to be updated on different threads with frequencies independent of the render loop. Dependencies still exist. For example, character animation still requires input from the AI. Rather than sit idle waiting for AI to complete its current calculation, however, the animation subsystem grabs the latest AI update. A separate thread periodically sends the latest frame to be rendered. (For more details and an example implementation using a simple collision detection system, see the references in the Suggested Reading at the end of this article.)
This method is not without disadvantages. First, the threads will occasionally need to access shared data. Data sharing could limit the effectiveness of this method depending on the level of synchronization required. Second, an asynchronous engine is definitely more complex than existing sequential engines. Designing an asynchronous engine will require some experimentation and calibration. Some subsystems take longer to compute than others. Some subsystems must be updated more frequently than others. It could also be advantageous to map more than one subsystem to the same thread, especially if those subsystems are tightly coupled. Fortunately, the industry is developing tools to assist with writing parallel code. For example, the Intel® Thread Checker identifies race conditions, such as storage conflicts and deadlocks, and other thread-specific problems that affect program correctness before they happen. The errors reported by the Intel® Thread Checker often indicate sections of code that require synchronization. For example, two threads simultaneously writing to the same memory location can cause data corruption. This type of race condition, known as a storage conflict, is prevented by synchronizing access to the shared variable or by giving each thread a private copy of the variable.
After the threading problems are fixed and the code is working correctly, performance improvements should be validated. For this, a developer can use the internal performance profilers to validate that the new code is providing the benefits expected. If these internal tools do not exist or a developer wishes to further validate the changes, the developer can use the Intel® Thread Profiler, to identify performance problems such as lock contention, load imbalance, excessive system overhead, etc., that limit multithreaded performance. Once identified, there are several common techniques to address parallel performance problems. For example, making some variables or data structures private to each thread can eliminate the need for synchronization.
Figure 3 illustrates the importance and difficulty of calibrating the computation frequency of each subsystem. The collision detection (CD) subsystem is mapped to thread-3 (T3). When the second CD computation begins (indicated by the first dotted blue line), the character animation (CA) is still in progress. Therefore, the first CD computation is still valid. Performing the second CD computation wastes processor resources and provides no additional information. New CA information is available by the third CD computation (indicated by the second dotted blue line), even though AI is still in progress. Similarly, the first and second render computations proceed with updated CA information but the third render computation does not. Without updated CA information, is rendering a new frame necessary? These examples show how the computation frequency of each subsystem can impact parallel efficiency.
|Figure 3. In this model, the artificial intelligence (AI), character animation (CA), and collision detection (CD) subsystems are mapped to separate threads that run asynchronously. A fourth thread periodically gathers the latest frame information and sends it off to be rendered.|
Threading real-time, event-driven games to take advantage of multiple processors is a complicated task. Retrofitting a legacy game engine with threads is possible but parallel performance may vary widely. Redesigning game engines with parallelism in mind is a better long-term solution.