This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.
The task engine code is in TaskScheduler.h/.inl/.cpp (header, inlines, and code). CTaskPool is primarily a collection of CWorkerThread, where the bulk of the logic resides. CInternalTask is the abstract superclass for all tasks; you will use subclasses of this class in the implementation of ParallelFor and CSorter.
ParallelFor is the simplest form of parallel code: a loop where iterations can execute independently of each other. Given a range and a method to process a section, work is spread over available threads and ParallelFor returns once it has covered the full range.
CSorter implements a simple parallel merge sort, spawning new tasks for blocks bigger than a given threshold. Although the goal is to reduce code size, this is done as a C++ template to avoid the overhead of calling a function every time two items need to be compared.
There is a very convenient effect here: code using these functions can still be understood as serial code. Code around a ParallelFor executes before and after it, just as it reads.
For uses beyond simple looping and sorting, you will need to spawn your own tasks. This is quite simple too:
pTask = new CMyTask(&Flag,...);
Your specific task is implemented by CMyTask and you use Flag to track when it is done. (Note that pThread must be the current thread.) Once PushTask has been called, the task is eligible to be executed by the scheduler, or may be stolen by another thread. The current thread can continue to do other things, including pushing more tasks, until it calls WorkUntilDone. This last call will run tasks from the thread's pile, or attempt to steal from other threads, until the completion flag is set. Again, it looks as if your task had been executed serially as part of the call (and it might have, indeed).
pTask = new CMyTask(pThread->m_pCurrentCompletion,...);
In this alternative form, the task is created as a continuation, and you don't wait for it to complete as whatever waits for you will now also wait for this new task. When possible, this is a better approach as this is less synchronization work.
These basic blocks are enough to implement all sorts of parallel algorithms used in games, in a fashion that reads serially. You still have to worry about access to shared data, but you can continue to write code that works in a series of steps which remain easy to read.
Looking at what is happening inside, you see CTaskPool is the central object; it creates and holds the worker threads. Initially, these are blocked waiting on a semaphore, the scheduler is idle and consumes no CPU. As soon as the first task is submitted, it is split between all threads (if possible) and the semaphore is raised by worker_count in one step, waking all threads as close to immediately as possible.
The pool keeps track of the completion flag for this root task and workers keep running until it becomes set. Once done, all threads go idle again and a separate semaphore is used to make sure all threads are back to idle before accepting any new task. Conceptually, all workers are always in the same state: either all idle or all running.
The role of CWorkerThread is to handle tasks, which can be broken down into processing, queuing, and stealing them.
Processing - The threadproc handles the semaphores mentioned earlier and repeatedly calls DoWork(NULL) when active. This method pops tasks from the pile until there are no more, and then it tries to steal from other workers and returns if it can't find anything to steal. DoWork also can be called by WorkUntilDone if a task needs to wait for another to finish before it can continue; in this case the expected completion flag is passed as a parameter and DoWork returns as soon as it finds it set.
Queuing - Because of stealing, there is a risk of contention. Since operations on the queue require a lock, use a spinning mutex, because you need to protect only a few instructions. PushTask increments the task's completion flag and puts the task at the top of the queue. In the special case when the queue is full, run the task immediately as this produces the correct result.
It's also worth noting that if the queue is full, then other workers must be busy too or they'd be stealing from you. Tasks also get executed immediately in the special case of a single core system because there is no point in queuing work when there is nothing to steal it; the whole scheduler is bypassed and the overhead of the library disappears.
Stealing - StealTasks handles the stealing. It loops on all other workers checking if one wants to GiveUpSomeWork. If a worker has only one task queued, it will attempt to split it in two and transfer a "half-task" to the idle thread. If it didn't split or if there is more than one, it will return half the tasks (rounding up). The fact that workers are spinning on StealTasks when their queue is empty enables them to return to work as soon as a task becomes available. This is important in the context of a game where latency tends to be more important than throughput.
There isn't much more to the scheduler than that. The rest is implementation details best left to discover in the source code. But before you do that, you should know how the Nulstein demo uses the scheduler to take maximum advantage of implicit synchronization and to achieve most of the frame in parallel.