Games are some of the most performance-demanding
applications around. The scientist studying proteins or the animator working on
the next photorealistic computer animated film can grudgingly wait for a
computation to finish; a game player cannot. Game developers have the
challenging task of squeezing as much performance as possible out of today's
This quest for performance has typically focused on graphics
tricks and optimizing low-level instructions. The increasing popularity of
multi-core CPUs in the consumer market has created an opportunity to make large
performance gains by optimizing for multi-threaded execution. Intel has created
a library called the Intel Threading Building Blocks (Intel TBB) to help
achieve this goal.
This article demonstrates multiple paths to success for game
architectures that optimize with Intel TBB. The techniques described are
oriented primarily toward optimizing game architectures that already have some
threading, showing how Intel TBB can enhance the performance of these
architectures with relatively small amounts of coding effort. Even for a serial
architecture, these techniques demonstrate straightforward ways of introducing
This article is divided into three sections, ordered by
increasing coding commitment.
The first section shows techniques in which
Intel TBB provides optimization opportunities with minor coding effort and no
The second section details how Intel TBB's
efficient implementation of loop parallelism can provide performance
enhancements throughout a game architecture.
The final section demonstrates techniques for
using Intel TBB as the basis for the threading in a game architecture and shows
how to implement common threading paradigms using Intel TBB.
Applying these techniques will ensure that a game
architecture is maximizing performance on the computers in the market now and
will automatically take advantage of future advances in hardware.
The samples presented are available in complete form as a
Microsoft Visual Studio project. Most of the samples can be ported to any
platform where Intel TBB is available.
This article refers to the performance characteristics of
these samples, as measured on a test system. Performance may vary on other
systems. The specification of the test system:
Toes in the Water with Efficient Work-Alikes
One of the easiest ways to optimize a game's architecture is
to swap in Intel TBB's high-performance implementations of standard containers
or memory allocators. Almost all game architectures use containers and allocate
memory dynamically, but the standard implementation of these common operations
can carry some performance penalties when accessed from multiple threads. Using
Intel TBB to optimize these operations requires minimal code changes.
Intel TBB provides concurrent implementations of common
standard containers, including vector, queue, and hash. These containers use
per-element locking to avoid contention from simultaneous access from multiple
threads. When accessing standard containers from multiple threads, it is
necessary to protect write accesses with mutual exclusion. Depending upon the
exclusion mechanism and the amount of contention, this can slow down execution
Sample 1: Intel TBB containers don't require mutual exclusion
class Sample1StandardKernel: public Kernel
// access a standard container, but protect it first
s_tStandardVector[i] = i;
class Sample1TBBKernel: public Kernel
containers need no protection
Sample 1 is an example of how a game architecture might
access a standard container and an Intel TBB container. The syntax is similar,
but the standard container requires the addition of a mutual exclusion object.
The code using the Intel TBB container is faster than the code using the
standard container by a factor of 1.21 (21% faster) on the
4-core test system.
Multi-threaded memory allocators
Any game that dynamically allocates memory from multiple
threads may be paying a hidden performance penalty. The standard
implementations of C-style and C++-style allocators use internal mutual
exclusion objects to allow multi-threaded access. Intel TBB provides a more
efficient multi-threaded memory allocator that maintains a heap per thread to
Sample 2: Intel TBB allocators have performance advantages
class Sample2StandardKernel: public Kernel
// allocate and
deallocate some memory in the standard fashion
*m_pBuffer = (unsigned int
malloc(sizeof(unsigned int) * 1000);
class Sample2TBBKernel: public Kernel
// allocate and
deallocate some memory in a TBB fashion
*m_pBuffer = (unsigned int
scalable_malloc(sizeof(unsigned int) *
Sample 2 is an example of memory allocation and deallocation
with both the standard C-style methods and with the multi-threaded Intel TBB
allocator. The only syntax difference is the name of the function. The code
using the Intel TBB allocator has a 1.17 speedup (17% faster)
relative to the standard code when run on the 4-core test system.