[In this sponsored technical article, part of Intel's Visual Computing section, Jerome Muffat-Meridol takes a look at Nulstein, his creation for low footprint in-game code task scheduling on multi-core processors.]
I attended my first demo party in 2008: Evoke in Germany. I was giving a talk about multi-core optimization in games and how to use Intel Threading Building Blocks (Intel TBB) to efficiently spread work over threads, when this question came up: Can I use this in 64K?
The rules for 64K demos are simple, "65536 bytes maximum, one self contained executable," and the results are often unbelievable. Intel TBB happens to be a really elegant and slim library but, at 200KB, it just won't do. But, I hate to say no.
Inevitably, I couldn't help but contemplate the idea of a sort of working scale model of Intel Threading Building Blocks. It would be a minimal task scheduler, something that would be easy to study, tear apart, and play with. I was on a mission!
Nulstein is the demo I created to address this need. It shows a simple but effective method for implementing task scheduling that can be adapted to most game platforms. Click this link to download the code to Nulstein.
If you are not familiar with task schedulers and why they are useful in games, the key lies in the difference between a thread and a task. A thread is a virtually infinite stream of operations which blocks when it needs to synchronize with another thread. A task, on the other hand, is a short stream of operations that executes a fraction of the work independently of other tasks and doesn't block.
These properties make it possible to execute as many tasks simultaneously as the processor can run physical threads, and the work of the task scheduler mainly comes down to finding a new task to start when one finishes. This becomes truly powerful when you add that a task can itself spawn new tasks, as part of its execution or as a continuation.
If the idea of splitting work in a collection of smaller tasks is straightforward, dealing with situations where a thread would normally block can be trickier. Most of the time a task can simply consume other tasks until the expected condition arises, and otherwise it is usually a simple matter of splitting the work in two tasks around the waiting point and letting the synchronization happen implicitly. But we'll come back to this later on.
Breaking work down into tasks and using a scheduler with task stealing is a convenient, powerful, and efficient way to make use of multi-core processors.
From a programming standpoint, on a system with n logical cores, Nulstein creates n-1 worker threads to assist the game's main thread with running the tasks. Each worker manages its own "pile of work," a list of tasks that are ready to run. Every time one task finishes the worker picks the next one from the top of its pile; similarly, when a task is created it is dropped directly on to the top of the pile.
This is much more efficient than having one global job queue, as each thread can work independently without any contention. But there is a catch: some piles might become empty much faster than others. In these cases, the scheduler steals the bottom half of the pile of a busy thread and gives it to a starving thread. This turns out to limit contention considerably because only two threads are impacted by the mutual exclusion necessary to carry out this operation.