Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Gamasutra: The Art & Business of Making Gamesspacer
Sponsored Feature: Who Moved the Goal Posts? The Rapidly Changing World of CPUs
View All     RSS
October 20, 2020
arrowPress Releases
October 20, 2020
Games Press
View All     RSS







If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 

Sponsored Feature: Who Moved the Goal Posts? The Rapidly Changing World of CPUs


October 19, 2009 Article Start Previous Page 4 of 7 Next
 

Threading Performance Considerations

One of the more important aspects of threading for game development is the interaction of the various threads with the cache. As previously mentioned, the cache configurations on the Intel Pentium 4 processor, the Intel Core processor family, and the Intel Core i7 processor differ. (Figure 9)


Figure 9: Cache layouts on different microarchitectures

The Intel Pentium 4 processor has to communicate between threads over the front-side bus, thus requiring at least a 400-500 cycle delay, while the Intel Core processor family allowed for communication over a shared L2 cache with a delay of only 20 cycles between pairs of cores and the front-side bus between multiple pairs on a quad-core design The use of a shared L3 cache in the Intel Core i7 processor means that going across a bus to synchronize with another core is unnecessary unless a multiple-socket system is being used.

And because the cache sizes are not increasing as rapidly as are the number of cores, developers need to think about how to best use the available finite resources. Techniques such as cache blocking, which limits inner loops to smaller sets of data and streaming stores, can help to avoid cache line pollution thereby reducing cache thrashing and minimizing the communications required over the last level of shared cache.

False sharing

One important potential cache pitfall is the false sharing that occurs when two threads access data that falls into the same cache line.


Figure 10. An example of false sharing.

Figure 10 shows an example of false sharing, in which the developer created a simple global object to store debug data. All four of the integers created easily fit into a single line of cache. However, multiple threads can access different portions of the structure. When any thread updates its relevant variable, such as the main render thread submitting a new draw call, it pollutes the entire cache line for any other thread accessing or updating information in that object. The cache line must be resynchronized across all the caches, which can decrease performance.

If those two threads are communicating via the front-side bus of an Intel Core processor or Intel Pentium 4 processor, the user can experience as much as a 400-500 cycle delay for that simple debug code update. Care should be taken when laying out variables to ensure their layout accounts for typical usage patterns. Tools such as the Intel VTune Performance Analyzer can be used to locate cache issues, such as false sharing.

Basic task queues

As an example of how cache design can significantly affect the performance of an algorithm, we can use a basic task manager that distributes a collection of tasks to worker threads as they become available after completing their previous jobs. This approach is a common way of adding scalable threading into an application by breaking the workload into discrete tasks that can be executed in parallel. However, as Figure 11 shows, several issues can arise from using this approach. First, both threads would have to access the same data structure as it pulls the information off the queue.



Figure 11a. Simple task queue. Figure 11b. Tasks assigned to threads

Depending on how the programming synchronizes the data, this delay can be anywhere from a few cycles with atomic operations on shared cache to a few hundred cycles when using the front-side bus to as much as a few thousand cycles if using a mutex or any other operation that requires a context switch. The overhead of pulling small tasks off the basic task queue can be quite significant and can even outweigh the benefits gained from parallelizing them.



Figure 11c. Procedural task added to queue. Figure 11d. New task executed on different core

Another problem with this queue design is there is no correlation between the threads and the data processed on them. Tasks that require the same data can be executed on different threads, increasing memory bandwidth and potentially causing sharing issues similar to those outlined above. Problems can also occur when developers put procedurally generated tasks (Figure 11b and 11c ) back into the queue without correlating the thread it executes on to the thread the parent job was originally executed on.

The developer is fortunate if the task happens to hit the same core and thread, and much of the data it requires may still reside in the cache from the previous task; otherwise, the task will need to be completely resynchronized again with the delays mentioned above( Figure 11d). And obviously as the number of available threads increases the complexity of such a system increases dramatically, and the chances of "lucky" cache hits decrease.


Article Start Previous Page 4 of 7 Next

Related Jobs

Remedy Entertainment
Remedy Entertainment — Espoo, Finland
[10.20.20]

Senior Material Artist
Remedy Entertainment
Remedy Entertainment — Espoo, Finland
[10.20.20]

Development Manager (Xdev Team)
Remedy Entertainment
Remedy Entertainment — Espoo, Finland
[10.20.20]

Animation Director
innogames
innogames — Hamburg, Germany
[10.20.20]

Mobile Software Developer (C++) - Video Game: Forge of Empires





Loading Comments

loader image