Knee Deep with Loop Parallelism
Thus far, the focus has been on optimizations that leave the
original code structure almost completely intact. The next level of
optimizations requires more significant localized code changes. In return for
these modifications, Intel TBB can provide considerably enhanced performance on
multi-core processors.
parallel_for
Like most computationally intensive programs, games make
heavy use of loops. Loops are a natural opportunity to optimize execution on a
multi-core system. Intel TBB provides a scheme for parallelizing loops with a
simple API. As expected, this optimization provides significant performance gains
over unmodified, serial loops. Even when the original code has already been
parallelized, Intel TBB's implementation can sometimes provide additional
performance benefits due to its efficient use of hardware resources.
Sample 3
void doSerialStandardTest(double *aLoopTimes, const
Kernel *pKernel)
{
...
// don't use a
thread at all
pKernel->process(0, kiReps);
...
}
void doParallelForTBBTest(double *aLoopTimes, const
Kernel *pKernel)
{
...
TBBKernelWrapper
tWrapper(pKernel);
tbb::parallel_for(
tbb::blocked_range<int>(0,
kiReps),
tWrapper
);
...
}
Sample 3 shows how Intel TBB's parallel_for function can be
applied to a loop for significant performance benefits on multi-core CPUs. The
code using the parallel_for shows a near linear speedup of 3.98
relative to the serial loop when run on the 4-core system.
Other parallel loop patterns
In addition to parallel_for, TBB has other functions for
parallelizing other types of loops. The function parallel_reduce handles loops
that are combining results from multiple iterations. The function parallel_do
handles iterator-based loops. There are also functions to handle sorting,
pipelined execution, and other loop-like operations.
All-in with Generalized Task Parallelism
The techniques demonstrated in the first two sections are
appropriate for developers looking to use Intel TBB piecemeal and to achieve
modest performance gains as a result. Even greater performance gains are
possible when Intel TBB is used as the foundation of a game's threading
architecture. This ensures that any explicit functional parallelism and the
data parallelism supported by Intel TBB use the same threads, which avoids
oversubscription and maximizes scalability. The techniques in the third section
show more ambitious ways of using Intel TBB that can help realize these
performance gains.
These examples use a low-level API in Intel TBB, called the
task scheduler API. The high-level API in Sample 3 uses this low-level API
internally. The task scheduler API allows code to directly manipulate the work
trees that Intel TBB uses to represent parallel work. Manipulation of these
work trees is necessary when implementing explicit functional parallelism and
other techniques that go beyond simple data parallelism.
Figure A: A visual representation of an Intel TBB work tree
Figure A shows a visualization of a work tree being executed
by Intel TBB. Each tree has only one root, although Intel TBB can process
multiple trees simultaneously. Execution starts with the call to spawn the
root, then there is a wait for execution of it to complete. The root of this
tree creates one child task. This task can call arbitrary pre-processing code
before optionally creating and executing more children and finally calling
post-processing code. When the child task completes, control passes back to the
root, which also completes, and the original wait is finally over. Diagrams of
this type will be used to illustrate the techniques in the following samples.
Intel TBB is utilized more heavily in the following
examples, but this does not imply that an existing threading API in a game must
change to reflect the paradigm of the task scheduler API. In most cases it is
possible to plumb the task scheduler API in underneath an existing API. The
following examples show how Intel TBB can support some threading paradigms commonly
used in games.
|