Our Properties: Gamasutra GameCareerGuide IndieGames Indie Royale GDC IGF Game Developer Magazine GAO
My Message close
Contents
Sponsored Feature: Do-it-yourself Game Task Scheduling
 
 
Printer-Friendly VersionPrinter-Friendly Version
 


Part of:



Latest News
spacer View All spacer
 
February 13, 2012
 
Road to the IGF: Bennett Foddy's GIRP
 
Release This: Twisted Metal, Uncharted: Golden Abyss premiere in North America
 
For one Kickstarter employee, Double Fine's project is literally a dream come true [37]
spacer
Latest Jobs
spacer View All     Post a Job     RSS spacer
 
February 13, 2012
 
CCP - China
Technical Director
 
Spark Unlimited, Inc.
Lead Designer
 
Spark Unlimited, Inc.
Lead Environment Artist
 
Spark Unlimited, Inc.
Lead Animator
 
Harmonix Music Systems
Production Department Manager
 
Infinity Ward / Activision
QA Manager
spacer
Latest Features
spacer View All spacer
 
February 13, 2012
 
arrow Virtual Goods - An Excerpt from Social Game Design: Monetization Methods and Mechanics [5]
 
arrow Principles of an Indie Game Bottom Feeder [29]
 
arrow Postmortem: CyberConnect 2's Solatorobo: Red the Hunter [1]
 
arrow Jerked Around by the Magic Circle - Clearing the Air Ten Years Later [48]
 
arrow Building the World of Reckoning [4]
 
arrow SPONSORED FEATURE: TwitchTV - How to Build Community Around Your Game in 2012 [13]
 
arrow Happy Action, Happy Developer: Tim Schafer on Reimagining Double Fine [9]
 
arrow Building an iOS Hit: Phase 1 [11]
spacer
Latest Blogs
spacer View All     Post     RSS spacer
 
February 13, 2012
 
Audio Passes: Success Through Layering [1]
 
The Parable of Feudal Japan [7]
 
What the current RPG can learn from Diablo 1 [8]
 
Double Fine's Kickstarter Windfall: Will Patronage Supplant Traditional Game Publishing? [11]
 
The Principles of Game Monetization [2]
spacer
About
spacer Editor-In-Chief/News Director:
Kris Graft
Features Director:
Christian Nutt
Senior Contributing Editor:
Brandon Sheffield
News Editors:
Frank Cifaldi, Tom Curtis, Mike Rose, Eric Caoili, Kris Graft
Editors-At-Large:
Leigh Alexander, Chris Morris
Advertising:
Jennifer Sulik
Recruitment:
Gina Gross
 
Feature Submissions
 
Comment Guidelines
Sponsor
Features
  Sponsored Feature: Do-it-yourself Game Task Scheduling
by Jérôme Muffat-Méridol [Programming, Visual Computing]
21 comments Share on Twitter Share on Facebook RSS
 
 
February 24, 2010 Article Start Page 1 of 3 Next
 

[In this sponsored technical article, part of Intel's Visual Computing section, Jerome Muffat-Meridol takes a look at Nulstein, his creation for low footprint in-game code task scheduling on multi-core processors.]

I attended my first demo party in 2008: Evoke in Germany. I was giving a talk about multi-core optimization in games and how to use Intel Threading Building Blocks (Intel TBB) to efficiently spread work over threads, when this question came up: Can I use this in 64K?

The rules for 64K demos are simple, "65536 bytes maximum, one self contained executable," and the results are often unbelievable. Intel TBB happens to be a really elegant and slim library but, at 200KB, it just won't do. But, I hate to say no.

Inevitably, I couldn't help but contemplate the idea of a sort of working scale model of Intel Threading Building Blocks. It would be a minimal task scheduler, something that would be easy to study, tear apart, and play with. I was on a mission!

Nulstein is the demo I created to address this need. It shows a simple but effective method for implementing task scheduling that can be adapted to most game platforms. Click this link to download the code to Nulstein.

Scheduling Tasks

If you are not familiar with task schedulers and why they are useful in games, the key lies in the difference between a thread and a task. A thread is a virtually infinite stream of operations which blocks when it needs to synchronize with another thread. A task, on the other hand, is a short stream of operations that executes a fraction of the work independently of other tasks and doesn't block.

These properties make it possible to execute as many tasks simultaneously as the processor can run physical threads, and the work of the task scheduler mainly comes down to finding a new task to start when one finishes. This becomes truly powerful when you add that a task can itself spawn new tasks, as part of its execution or as a continuation.

If the idea of splitting work in a collection of smaller tasks is straightforward, dealing with situations where a thread would normally block can be trickier. Most of the time a task can simply consume other tasks until the expected condition arises, and otherwise it is usually a simple matter of splitting the work in two tasks around the waiting point and letting the synchronization happen implicitly. But we'll come back to this later on.

Breaking work down into tasks and using a scheduler with task stealing is a convenient, powerful, and efficient way to make use of multi-core processors.

From a programming standpoint, on a system with n logical cores, Nulstein creates n-1 worker threads to assist the game's main thread with running the tasks. Each worker manages its own "pile of work," a list of tasks that are ready to run. Every time one task finishes the worker picks the next one from the top of its pile; similarly, when a task is created it is dropped directly on to the top of the pile.

This is much more efficient than having one global job queue, as each thread can work independently without any contention. But there is a catch: some piles might become empty much faster than others. In these cases, the scheduler steals the bottom half of the pile of a busy thread and gives it to a starving thread. This turns out to limit contention considerably because only two threads are impacted by the mutual exclusion necessary to carry out this operation.

 
Article Start Page 1 of 3 Next
 
Comments

Samuel Batista
profile image
Thank you so much for this. I've been meaning to learn how to structure a game engine to deal with parallelism, but I don't have much patience to read large amounts of documents, I'd rather delve into code and learn by coding.

This provides a great stepping stone not only to learn how task based tech design works, but it also contains a sample for an incredibly efficient rendering system!

Thank you again, fantastic, coding gold!

Jerome Muffat-Meridol
profile image
Glad you like it !

It's exactly the spirit of this project: share a minimal amount of code that still can serve as a basis for an engine.

I am very interested in any feedback and also suggestions on where to take this next. I'm currently putting finishing touches to a DX11 version that solves the problem of the serial render phase : then you'll have an incredibly efficient rendering system, for good :)

Brian B
profile image
The downloadable project Nulstein was flagged by my virus scanner, it contains a trojan.

Jerome Muffat-Meridol
profile image
The downloadable project doesn't actually contain a trojan, but a compiled version of the code which was compressed down to 16K using a lovely tool by demoscene group Farbrausch (kkrunchy). I presume some virus scanners will treat this as a threat because they don't know how to decompress it. If anybody would rather get the source without the executable, feel free to send me an email (jerome dot muffat dash meridol at intel dot com)

Ken Noland
profile image
Good article!

One thing I also find myself fighting with is hardware resources. For things like hard drive access, seek times can kill your frame rate if you have multiple hard drive accesses occuring simultaneously. One thing I find helps is splitting tasks when a hard drive access is required and then calling the last part of the task once the hard drive access completes. Hard drive access tasks get high priority queue access(inserted to the front or to a seperate queue altogether), but only one hard drive access task can be completed at any time. This generally satisfies the audio system and the streaming level data and keeps both systems fed with data at a fairly constant rate. Overall great article, but I just thought I should mention that :)

Jerome Muffat-Meridol
profile image
This is a very good point, Ken: a task scheduler like this one (or the one in Threading Building Blocks) doesn't work well with tasks that need waiting. On I/O, on network, on other processes or otherwise. Tasks want to be pure CPU work. The classic solution is what you mention, break in two tasks at waiting point and have another thread push the second task when appropriate. (BTW, it does work if you schedule multiple IOs at once, my photo browser works like this).

The next difficulty is to deal with the possibility that this other task may be spawned at any (unexpected time), and this can lead to a lot of contention headaches, which usually lead to the knee-jerk reaction of starting to drop mutexes all over the place (creating more contention). I reckon that in the context of a game engine, I wouldn't even hope an IO finishes in the same frame, I would go for a model where an entity would fire the IO on one frame and check for completion on subsequent frames, in some form of benign active polling: regardless of the IO having finished or not, we have to update the game state.

I might go over this in a future article, as it also leads into the difficult problem of background work...

Gregory Pakosz
profile image
Hello Jerome,

I'm curious about how much Nulstein differs from Jobswarm as both are "alternatives" to the full-fledged TBB.

Also, how does OpenMP enter the picture? Is it possible to use it instead of explicit threading?

Jerome Muffat-Meridol
profile image
JobSwarm (http://code.google.com/p/jobswarm/) is the simplest approach possible: there is one circular buffer that serves as a job queue, and worker threads pump from it as they need. What happens with this sort of configuration is that the queue soon becomes the main contention point. As the size of the jobs gets smaller, the overhead of accessing it increases. The impact is minimal in JobSwarm because of another aspect of it: only the main thread can submit work. If this is enough for your application, then this is pretty much as simple as it can get.

In TBB (and nulstein), a task can be further subdivided (i.e. it can spawn more tasks). This makes it almost trivial to cut&dice your workload: tasks spawn more tasks and split further until we have workloads that don't benefit from being split further. There are two big consequences to this feature: you can't have one big centralized queue as it would cause too much contention, you need a queue per worker thread. This leads to the second issue, imbalance: some tasks take more time than others and this implies some queues empty faster than others. The solution is "work stealing" which, in effect, ends up load-balancing the system.

The mechanism inside OpenMP is very similar, I'm going to simplify and say it's just the programming interface that is different.

And that's really what it all comes down to: whichever of these systems you use, what you're trying to do is to _not thread explicity_ very much in the same way as when you use Pascal or C, it's to avoid using GOTOs/JMPs explicity :)

(my advice: when unsure, choose TBB)

mark harris
profile image
Jerome, in CInternalForTask::Spread() you divide the task range by the number of worker threads, and then assign the remainder to the last task:

pLastTask->m_Range.end = end;

I'm wondering how often this leads to load imbalance. In the case where you have a range of 7 (granularity 1) and 4 threads, you'll end up with 3 threads with 1 subtask each, and one thread with 4. There are various ways you could make this more intelligent to better spread uneven loads, I think.

One thing I wondered was why you don't just push the subtasks and let the worker threads wake up and steal them, rather than explicitly spreading the work?

Nice article, thanks.

mark harris
profile image
-

Jerome Muffat-Meridol
profile image
Mark: Well spotted ! I feel stupid now...
In this case, we should divide the work in (2,2,2,1), not (1,1,1,4), what was I thinking about?

As of the impact, well, normally you would work with ranges much bigger than 7, and the larger the range the smaller the imbalance. Plus task stealing will come to your rescue. So probably not a very big impact.

But still... I'm fixing this for next version (it's only a matter of spreading the rest of the division correctly)

mark harris
profile image
Thanks Jerome. Can you comment on my other question?

To clarify it a bit:

One thing I wondered was why you don't just push all the subtasks to the current thread and then let the worker threads wake up and steal them, rather than explicitly spreading the work? Is there overhead in this that you are avoiding by making it explicit? It seems like just letting work stealing distribute the work for you would make for simpler, more elegant code.

Jerome Muffat-Meridol
profile image
I must say that the "simpler, more elegant code" argument is one dear to my ears...

First, let's note that TBB does what you say: it splits work recursively as it steals it. Not only does it make sense intuitively, but it also becomes a necessity when you're dealing with multi-processor cases.

So why do I bother? I do in only one case: when workers are idle and we're just getting started. On a Core i7, I'd split the "for" range 8 ways and then start up. Actual work starts immediately on all threads and in the simple cases, they all finish more or less together. Nice and simple.

If I let stealing do this for me, there is a little quirk: the thread I decide to steal from is chosen at random, mutexing its way in a loop. As the core count increases, it's more and more likely that we don't pick the largest available range and end up splitting in more chunks than necessary. So we have the contention overhead of stealing (mutexes are never totally free) and we also have the problem of working on smaller ranges (ie potentially taking the workload to a different cache).

This last one is the windmill I was attacking when writing this: less splits means less contention, means small regular workloads work faster.

Later I added another method that deals with the opposite problems: irregular workloads (because things like the update loop are like that). This resulted in the "PartialPop" method, where a thread only consumes a granular part of the available range and leaves the rest available for stealing (yes, you could call this "stealing from yourself").

This last feature made the game loop balance a lot better, at the expense of putting more pressure on the mutex and making it important to choose the granularity value carefully to avoid spending more time in PartialPop than in the task itself.

So, where is the pre-spreading useful? Only in cases where you have a parallel_for that does very simple work (ex DXT compression) and is called directly from main thread (ie when workers are idle). It's not exactly bloat, but it's not a key optimisation either.

Gregory Pakosz
profile image
Jerome, what I meant by "explicit threading" was avoiding having to port the code to something else than the windows api (I bet it would be pthread for Linux, Mac, or any POSIX system).

But I guess linking against OpenMP in order to get cross platform mutex and thread objects wasn't an option considering the 64k target size, right?

Jerome Muffat-Meridol
profile image
Hi Gregory:

If I was going to start something cross-platform, I'd use TBB. Actually, even if I was going to start something purely Windows based, I'd choose TBB (and have done so even before joining Intel). It gives you task scheduling with stealing, plus the cross platform mutex and threads, plus parallel-friendly allocators and containers, plus more... I'm not too comfortable with OpenMP, mainly because of how much it hides behind #pragma's: as a former game developer, I feel safer with solutions I know I can tweak if/when I need to port to a new platform (TBB has been ported to XBox360 by Ukrainian studio Deep Shadows, for example)

Of course, for a 64K, I'd use nulstein ;)


Ken Noland
profile image
By the way, just a clarification, when I said only one file access can be done at a time I was not talking about the OS/Hardware level, I was talking about building that into the seperate queue used to process file tasks as a limitation/feature. On systems with DVD/BluRay then seek times are horrible when doing multiple file reads so building that into your IO framework allows better loading times. If your platform doesn't have that contraint then you could distribute it and rely heavily on Quickpath Interconnect to handle the rest.

-Just a note- :)

Tomasz Mazurek
profile image
Aside from the scheduler implementation, I was absolutely delighted with the elegant idea of splitting the objects' ticks into "gather information about others" and "update your own state". It leads to the paradigm of "update only (or mostly) based on previous tick state (and inputs)". Having some education in digital circuit theory I can't help noticing how similar this is to how multiple IC's are coordinated in synchronous sequential logic - until the clock "ticks" the circuitry, only previous state can be accessed.

I find this observation interesting, since it implies that the multi-threaded programs not obeying the "update based on previous tick state" paradigm are analogues to asynchronous sequential logic. And the important fact is that asynchronous circuits are by many orders of magnitude more complicated to design, implement and test then the synchronous ones.

Jerome Muffat-Meridol
profile image
Ken,

I can't agree more about optical media seek times... Having started game development when CDs were the new cool thing (wow a whole 600+MB for assets!), and spent half the time on consoles, I do know what you mean :) Somehow, I was focused on HDD/internet access. I seem to recollect, though, that there are tricks that can let you stream from several tracks at once, but I haven't looked at this in a very long time. In any case, the important point remains that you want to keep IOs out of the scope of the task scheduler and let it deal with computing tasks only.

Tomasz,
Glad you liked it, it's actually what I consider the most important part of this work ! I really am a software person and am not as confident with hardware, let alone electronics, so I can't tell you if the comparison is accurate or not (but now that I work for a hardware company, it might be an interesting topic to share with our architects). What I can tell is that the model presented here is working but is not covering all bases: it's all nice and good for cases where an entity expects what is happening to it, but there is no provision for an entity telling others something. Examples:
- civilian: bullet says you're dead
- bottles on bar shelf: bullet says you're broken (activate and do the broken-bottle dance)

What's missing is a message passing system. As it turns out, it is quite simple to implement with the following two rules:
- entities can read messages only during pre-update
- entities can send messages only during update

Messages are defined as (recipient, message_id, data) and can be queued per thread with no contention, queues get sorted by recipient before next update and entities can then bsearch this for whatever they're interested in (recipient could be an entity id, but also other values that could represent groups, I like the idea of 0 meaning everybody).

This is all meat for another article that would talk about relationship between entities and how they can all communicate efficiently in parallel, but also about the complicated problem of birth and death (which turns out to not be that complicated in this model)

But first, we need to parallelize the draw phase: figure 3 tells us that we're updating the frame plenty fast, but drawing it is a chunky serial bit of work...

Tomasz Mazurek
profile image
If the data used by the render phase is the same that the one used by subsequent pre-update phase, then rendering could be run as a pre-render task. If that is difficult, then other method would be to use the DX11 capability of recording the command buffer by multiple threads in the draw stage, I am not sure how the sort stage would work then, but the render stage would only consist of running the accumulated command buffer, thus being pretty much independent of any data outside the command buffer and being able to run alongside both the pre-update and update phase.

Jerome Muffat-Meridol
profile image
I was only hinting at my next article looking into this topic.
(and, yes, it will involve DX11 deferred contexts. Enough said ;) )

Jerome Muffat-Meridol
profile image
The follow up to this article will be presented at GDC europe, in Cologne, on 18th august 3:20pm-4:10pm
See here:
https://www.cmpevents.com/GDCE10/a.asp?option=C&V=11&SessID=11257


none
 
Comment:
 




UBM Techweb
Game Network
Game Developers Conference | GDC Europe | GDC Online | GDC China | Gamasutra | Game Developer Magazine | Game Advertising Online
Game Career Guide | Independent Games Festival | Indie Royale | IndieGames

Other UBM TechWeb Networks
Business Technology | Business Technology Events | Telecommunications & Communications Providers

Privacy Policy | Terms of Service | Contact Us | Copyright © UBM TechWeb, All Rights Reserved.