|
[In this sponsored technical article, part of Intel's Visual Computing section, Jerome Muffat-Meridol takes a look at Nulstein, his creation for low footprint in-game code task scheduling on multi-core processors.]
I attended my first demo party in 2008: Evoke in Germany. I
was giving a talk about multi-core optimization in games and how to use Intel
Threading Building Blocks (Intel TBB) to efficiently spread work over threads,
when this question came up: Can I use this in 64K?
The rules for 64K demos are
simple, "65536 bytes maximum, one self contained executable," and the results
are often unbelievable. Intel TBB happens to be a really elegant and slim
library but, at 200KB, it just won't do. But, I hate to say no.
Inevitably, I
couldn't help but contemplate the idea of a sort of working scale model of
Intel Threading Building Blocks. It would be a minimal task scheduler,
something that would be easy to study, tear apart, and play with. I was on a
mission!
Nulstein is the demo I created to address this need. It
shows a simple but effective method for implementing task scheduling that can
be adapted to most game platforms. Click this link to download the
code to Nulstein.
Scheduling Tasks
If you are not familiar with task schedulers and why they
are useful in games, the key lies in the difference between a thread and a
task. A thread is a virtually infinite stream of operations which blocks when
it needs to synchronize with another thread. A task, on the other hand, is a
short stream of operations that executes a fraction of the work independently
of other tasks and doesn't block.
These properties make it possible to execute
as many tasks simultaneously as the processor can run physical threads, and the
work of the task scheduler mainly comes down to finding a new task to start
when one finishes. This becomes truly powerful when you add that a task can
itself spawn new tasks, as part of its execution or as a continuation.
If the
idea of splitting work in a collection of smaller tasks is straightforward,
dealing with situations where a thread would normally block can be trickier.
Most of the time a task can simply consume other tasks until the expected
condition arises, and otherwise it is usually a simple matter of splitting the
work in two tasks around the waiting point and letting the synchronization
happen implicitly. But we'll come
back to this later on.
Breaking work down into tasks and using a scheduler with
task stealing is a convenient, powerful, and efficient way to make use of multi-core
processors.
From a programming standpoint, on a system with n logical cores, Nulstein creates n-1 worker threads to assist the game's
main thread with running the tasks. Each worker manages its own "pile of work,"
a list of tasks that are ready to run. Every time one task finishes the worker
picks the next one from the top of its pile; similarly, when a task is created
it is dropped directly on to the top of the pile.
This is much more efficient
than having one global job queue, as each thread can work independently without
any contention. But there is a catch: some piles might become empty much faster
than others. In these cases, the scheduler steals the bottom half of the pile
of a busy thread and gives it to a starving thread. This turns out to limit
contention considerably because only two threads are impacted by the mutual
exclusion necessary to carry out this operation.
|
This provides a great stepping stone not only to learn how task based tech design works, but it also contains a sample for an incredibly efficient rendering system!
Thank you again, fantastic, coding gold!
It's exactly the spirit of this project: share a minimal amount of code that still can serve as a basis for an engine.
I am very interested in any feedback and also suggestions on where to take this next. I'm currently putting finishing touches to a DX11 version that solves the problem of the serial render phase : then you'll have an incredibly efficient rendering system, for good :)
One thing I also find myself fighting with is hardware resources. For things like hard drive access, seek times can kill your frame rate if you have multiple hard drive accesses occuring simultaneously. One thing I find helps is splitting tasks when a hard drive access is required and then calling the last part of the task once the hard drive access completes. Hard drive access tasks get high priority queue access(inserted to the front or to a seperate queue altogether), but only one hard drive access task can be completed at any time. This generally satisfies the audio system and the streaming level data and keeps both systems fed with data at a fairly constant rate. Overall great article, but I just thought I should mention that :)
The next difficulty is to deal with the possibility that this other task may be spawned at any (unexpected time), and this can lead to a lot of contention headaches, which usually lead to the knee-jerk reaction of starting to drop mutexes all over the place (creating more contention). I reckon that in the context of a game engine, I wouldn't even hope an IO finishes in the same frame, I would go for a model where an entity would fire the IO on one frame and check for completion on subsequent frames, in some form of benign active polling: regardless of the IO having finished or not, we have to update the game state.
I might go over this in a future article, as it also leads into the difficult problem of background work...
I'm curious about how much Nulstein differs from Jobswarm as both are "alternatives" to the full-fledged TBB.
Also, how does OpenMP enter the picture? Is it possible to use it instead of explicit threading?
In TBB (and nulstein), a task can be further subdivided (i.e. it can spawn more tasks). This makes it almost trivial to cut&dice your workload: tasks spawn more tasks and split further until we have workloads that don't benefit from being split further. There are two big consequences to this feature: you can't have one big centralized queue as it would cause too much contention, you need a queue per worker thread. This leads to the second issue, imbalance: some tasks take more time than others and this implies some queues empty faster than others. The solution is "work stealing" which, in effect, ends up load-balancing the system.
The mechanism inside OpenMP is very similar, I'm going to simplify and say it's just the programming interface that is different.
And that's really what it all comes down to: whichever of these systems you use, what you're trying to do is to _not thread explicity_ very much in the same way as when you use Pascal or C, it's to avoid using GOTOs/JMPs explicity :)
(my advice: when unsure, choose TBB)
pLastTask->m_Range.end = end;
I'm wondering how often this leads to load imbalance. In the case where you have a range of 7 (granularity 1) and 4 threads, you'll end up with 3 threads with 1 subtask each, and one thread with 4. There are various ways you could make this more intelligent to better spread uneven loads, I think.
One thing I wondered was why you don't just push the subtasks and let the worker threads wake up and steal them, rather than explicitly spreading the work?
Nice article, thanks.
In this case, we should divide the work in (2,2,2,1), not (1,1,1,4), what was I thinking about?
As of the impact, well, normally you would work with ranges much bigger than 7, and the larger the range the smaller the imbalance. Plus task stealing will come to your rescue. So probably not a very big impact.
But still... I'm fixing this for next version (it's only a matter of spreading the rest of the division correctly)
To clarify it a bit:
One thing I wondered was why you don't just push all the subtasks to the current thread and then let the worker threads wake up and steal them, rather than explicitly spreading the work? Is there overhead in this that you are avoiding by making it explicit? It seems like just letting work stealing distribute the work for you would make for simpler, more elegant code.
First, let's note that TBB does what you say: it splits work recursively as it steals it. Not only does it make sense intuitively, but it also becomes a necessity when you're dealing with multi-processor cases.
So why do I bother? I do in only one case: when workers are idle and we're just getting started. On a Core i7, I'd split the "for" range 8 ways and then start up. Actual work starts immediately on all threads and in the simple cases, they all finish more or less together. Nice and simple.
If I let stealing do this for me, there is a little quirk: the thread I decide to steal from is chosen at random, mutexing its way in a loop. As the core count increases, it's more and more likely that we don't pick the largest available range and end up splitting in more chunks than necessary. So we have the contention overhead of stealing (mutexes are never totally free) and we also have the problem of working on smaller ranges (ie potentially taking the workload to a different cache).
This last one is the windmill I was attacking when writing this: less splits means less contention, means small regular workloads work faster.
Later I added another method that deals with the opposite problems: irregular workloads (because things like the update loop are like that). This resulted in the "PartialPop" method, where a thread only consumes a granular part of the available range and leaves the rest available for stealing (yes, you could call this "stealing from yourself").
This last feature made the game loop balance a lot better, at the expense of putting more pressure on the mutex and making it important to choose the granularity value carefully to avoid spending more time in PartialPop than in the task itself.
So, where is the pre-spreading useful? Only in cases where you have a parallel_for that does very simple work (ex DXT compression) and is called directly from main thread (ie when workers are idle). It's not exactly bloat, but it's not a key optimisation either.
But I guess linking against OpenMP in order to get cross platform mutex and thread objects wasn't an option considering the 64k target size, right?
If I was going to start something cross-platform, I'd use TBB. Actually, even if I was going to start something purely Windows based, I'd choose TBB (and have done so even before joining Intel). It gives you task scheduling with stealing, plus the cross platform mutex and threads, plus parallel-friendly allocators and containers, plus more... I'm not too comfortable with OpenMP, mainly because of how much it hides behind #pragma's: as a former game developer, I feel safer with solutions I know I can tweak if/when I need to port to a new platform (TBB has been ported to XBox360 by Ukrainian studio Deep Shadows, for example)
Of course, for a 64K, I'd use nulstein ;)
-Just a note- :)
I find this observation interesting, since it implies that the multi-threaded programs not obeying the "update based on previous tick state" paradigm are analogues to asynchronous sequential logic. And the important fact is that asynchronous circuits are by many orders of magnitude more complicated to design, implement and test then the synchronous ones.
I can't agree more about optical media seek times... Having started game development when CDs were the new cool thing (wow a whole 600+MB for assets!), and spent half the time on consoles, I do know what you mean :) Somehow, I was focused on HDD/internet access. I seem to recollect, though, that there are tricks that can let you stream from several tracks at once, but I haven't looked at this in a very long time. In any case, the important point remains that you want to keep IOs out of the scope of the task scheduler and let it deal with computing tasks only.
Tomasz,
Glad you liked it, it's actually what I consider the most important part of this work ! I really am a software person and am not as confident with hardware, let alone electronics, so I can't tell you if the comparison is accurate or not (but now that I work for a hardware company, it might be an interesting topic to share with our architects). What I can tell is that the model presented here is working but is not covering all bases: it's all nice and good for cases where an entity expects what is happening to it, but there is no provision for an entity telling others something. Examples:
- civilian: bullet says you're dead
- bottles on bar shelf: bullet says you're broken (activate and do the broken-bottle dance)
What's missing is a message passing system. As it turns out, it is quite simple to implement with the following two rules:
- entities can read messages only during pre-update
- entities can send messages only during update
Messages are defined as (recipient, message_id, data) and can be queued per thread with no contention, queues get sorted by recipient before next update and entities can then bsearch this for whatever they're interested in (recipient could be an entity id, but also other values that could represent groups, I like the idea of 0 meaning everybody).
This is all meat for another article that would talk about relationship between entities and how they can all communicate efficiently in parallel, but also about the complicated problem of birth and death (which turns out to not be that complicated in this model)
But first, we need to parallelize the draw phase: figure 3 tells us that we're updating the frame plenty fast, but drawing it is a chunky serial bit of work...
(and, yes, it will involve DX11 deferred contexts. Enough said ;) )
See here:
https://www.cmpevents.com/GDCE10/a.asp?option=C&V=11&SessID=11257