|
[Single-threaded game engines can still work, but they are becoming increasingly outclassed by multithreaded solutions that are more sophisticated, but also more complex to create and optimize. In this sponsored article, part of the Intel Visual Computing Microsite, Intel application engineer Jeff Andrews lays that process bare.]
With the advent of multiple cores within a processor, the need to create a parallel game
engine has become increasingly important. Although it is still possible to focus primarily
on only the GPU and have a single-threaded game engine, the advantages of using all the
processors on a system, whether CPU or GPU, can give the user a much greater experience.
For example, by using more CPU cores, a game can increase the number of rigid body
physics objects for greater effects on the screen. Optimized games might also yield a
smarter AI.
This white paper covers the basic methods for creating and optimizing a multi-core
game engine. It also describes the state manager and messaging mechanism that
keeps data in sync and offers several tips for optimizing parallel execution. Multiple
block diagrams are used to depict theoretical overviews for managing tasks and states,
interfacing with scenes, objects, and tasks, and then initializing, loading, and looping
within synchronized threads.
Introduction
The Parallel Game Engine Framework or engine is a multi-threaded game engine that is designed to scale to all available
processors within a platform. To do this, the engine executes different functional blocks in parallel so that it can use
all available processors. However, this is often easier said than done, because many pieces in a game engine often
interact with one another and can cause many threading errors. The engine takes these scenarios into account and has
mechanisms for getting the proper synchronization of data without being bound by synchronization locks. The engine
also has a method for executing data synchronization in parallel to keep serial execution time to a minimum.
Parallel Execution State
The concept of a parallel execution state in an engine is crucial to
an efficient multi-threaded runtime. For a game engine to truly run
in parallel, with as little synchronization overhead as possible, each
system must operate within its own execution state and interact
minimally with anything else going on in the engine. Data still needs
to be shared, but now instead of each system accessing a common
data location to get position or orientation data, for example, each
system has its own copy-removing the data dependency that exists
between different parts of the engine. If a system makes any changes
to the shared data, notices are sent to a state manager, which in turn
queues the changes, called messaging. Once the different systems
have finished executing, they are notified of the state changes and
update their internal data structures, which is also part of messaging.
Using this mechanism greatly reduces synchronization overhead,
allowing systems to act more independently.
Execution Modes
Execution state management works best when operations are
synchronized to a clock, which means the different systems execute
synchronously. The clock frequency may or may not be equivalent
to a frame time, and it is not necessary for it to be so. The clock time
does not even have to be fixed to a specific frequency, but instead
can be tied to frame count, such that one clock step is equal to how
long it takes to complete one frame, regardless of length. The implementation
one chooses to use for the execution state will determine
the clock time. Figure 1 shows the different systems operating in
the free step mode of execution, which means they don't have to
complete their execution on the same clock. There is also a lock
step mode of execution (Figure 2) in which all systems complete in
one clock. The main difference between the two modes is that free
step provides flexibility in exchange for simplicity, while lock step is
the reverse.
Figure 1. Execution state using the free step mode.
Free Step Mode
This execution mode allows a system to operate in the time it needs
to complete its calculations. "Free" can be misleading, because a
system is not free to complete whenever it wants, but is free to select
the number of clocks it needs to execute.
With this method, a simple notification of a state change to the state
manager is not enough. Data must also be passed along with the
state-change notification because a system that has modified shared
data may still be executing when another system that wants the data
is ready to do an update. This requires the use of more memory and
copies, so it may not be the most ideal mode for all situations. Given
the extra memory operations, free step might be slower than lock
step, although this is not necessarily true.
Lock Step Mode
With this mode all systems must complete their execution in a single
clock. Lock step mode is simpler to implement and does not require
data to be passed with the notification because systems that are
interested in a change made by another system can simply query that
system for the value (at the end of execution).
Figure 2. Execution state using the lock step mode.
The lock step mode can also implement a pseudo-free-step mode of
operation by staggering calculations across multiple steps. One use
for this might be with an AI that calculates its initial "large view" goal
in the first clock, then instead of just repeating the goal calculation
for the next clock comes up with a more focused goal based on the
initial goal.
Data Synchronization
It is possible for multiple systems to make changes to the same
shared data. To do this, the messaging needs some sort of mechanism
that can determine the correct value to use. Two such mechanisms
can be used:
Data values that are determined to be stale, via either mechanism, will
simply be overwritten or thrown out of the change notification queue.
Using relative values for the shared data can be difficult because
some data may be order-dependent. To alleviate this problem, use
absolute data values so that when systems update their local values
they replace the old with the new. A combination of both absolute
and relative data is ideal, although its use depends on each specific
situation. For example, common data, such as position and orientation,
should be kept absolute because creating a transformation matrix
for it depends on the order in which the data are received. A
custom system that generated particles, via the graphics system,
and fully owned the particle information could merely send relative
value updates.
|
Reminds me of Erlang.
Can you post some test results ?
Corei7 performance scaling will probably be decent (around 4x I bet) :p
But did you have the opportunity to tests on a xeon 7400 (6 cores) or on a dualxeon (12 cores) ?
Vincent