Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Gamasutra: The Art & Business of Making Gamesspacer
Sponsored Feature: Who Moved the Goal Posts? The Rapidly Changing World of CPUs
View All     RSS
October 20, 2020
arrowPress Releases
October 20, 2020
Games Press
View All     RSS

If you enjoy reading this site, you might also want to check out these UBM Tech sites:


Sponsored Feature: Who Moved the Goal Posts? The Rapidly Changing World of CPUs

October 19, 2009 Article Start Previous Page 7 of 7

Processor Topology Pitfalls

One seemingly straightforward way to program and plan for these different cache hierarchies and Intel HT Technology scenarios is to find the topology of the specific processor on the user's system and then tailor the algorithms to specifically target the exact hardware. The problem lies in exactly defining a processor topology even with knowledge of the CPUID (derived from CPU IDentification) specifications.

The use of CPUID has changed over time as additional leafs have been added. The incorrect but common way used in the past to calculate the number of available processors was to use CPUID leaf 4. This accidentally worked on previous architectures because it provided the maximum number of addressable ID in the physical package, and on earlier hardware this was always the same as the actual number of processors. However, this assumption doesn't hold true on the Intel Core i7 processor.

Figure 16. Testing for exact processor topology is very complex.

There are ways to make CPUID work to your advantage. However they are extremely complex, and it is a long multistep process (Figure 16). Several good white papers with concrete examples of CPUID detection algorithms can be found on Intel's Web site. Once in place the CPUID information can tell you not only how many cores the user's processor has but also what cores share which caches. This opens up the possibility of tailoring the algorithms for specific architectures, though the performance and reliability of the applications can easily be downgraded if corners are cut during this process.

Thread affinity

Even more complications arise when it comes to using thread affinitization in a game engine. Masks generated by topology enumeration can vary with the OS, OS versions (x64 versus x86), and even between service packs. This can make setting thread affinity a very risky thing to do in a program. The order in which the cores appear in a CPUID enumeration on the Intel Pentium 4 processor is completely different from how they appear on the Intel Core i7 processors, and the Intel Core i7 processor can show up in reverse order between OSs. The affinitization of the code will likely cause more headaches than prevent them if these differences aren't accounted for through careful enumeration and algorithm design.

Developers must also consider the entire software environment of the game and the OS; middleware within an application can create its own threads and may not provide an API that allows them to be scheduled well with the user's code. Graphics drivers also spawn their own threads that are assigned to processor cores by the OS, and thus the work that went into properly coding the affinity for the game threads can be polluted. Setting the affinity in the application is basically betting on short-term gains against a world of potential headaches down the development road.

If hard-coded thread affinity with incorrect CPUID information is used, the application performance will in fact get hit twice! In the worst case scenario, applications can fail to run on newer hardware because of the assumptions made on previous architectures. For gaming the advice is to avoid using thread affinity because it provides a short-term solution that can cause long-term headaches if things go wrong-far outweighing any gains. Instead, use processor hints such as SetIdealProcessor rather than a hard binding.

If a developer decides to go down this road, a safety feature should be built in allowing the user to disable affinitization if future processor architectures and hardware greatly impact the performance or stability, or better still allow an opt-in approach and ensure the application performs as well as possible without affinitization in its default configuration. If the software requires affinitization to run, there is probably a nasty bug waiting to happen on some combination of the hardware and software environments.

OS scheduling

Each version of the Microsoft Windows* OS handles thread scheduling slightly differently; even Windows 7 has improvements over how Windows Vista* handles them. The scheduler is a priority-based, round-robin system that is designed to be fair to all tasks currently running on the system. The time allotment that Windows gives for threads to execute before being pushed out is actually quite large- around 20-30 milliseconds or so-unless an event triggers a voluntary switch. For a game this is a very large span of time, potentially lasting more than a frame.

Given the course gain OS scheduling it's important for a game to correctly schedule its own threads to ensure tasks execute for an appropriate amount of time. Incorrect thread synchronization can mean that minor background or low priority threads run far longer than expected, significantly affecting the game's critical threads. Unfortunately, too much synchronization can also be a bad thing because changing between threads will cost thousands of cycles.

Too many context switches between threads sharing data can lower performance. To avoid the performance penalty of context switching, it's beneficial to execute a short spin-lock on resources that are likely to be used again very soon. Of course, performance degradation will occur if the waits are too long, so monitoring thread locks closely is important.

Figure 17. Self monitoring the threads allows easy location of performance bottlenecks.

Figure 17 shows a way to enumerate the threads that the application and the middleware the program uses have created. Vista introduced a few new API additions that, once the threads are enumerated, can tell the developer how much time has been spent on this particular core, how long a particular thread has been running, and so on. Taking this into consideration, it would be easy to build an on-screen diagnostic breakdown of threading performance for development or even end users.

What Lies Ahead

Two interesting technologies come to mind when thinking about the future of multi-threading, microarchitectures and gaming. The next "tock" called Sandy Bridge architecture will introduce the Intel Advanced Vector Extensions (Intel AVX). Expanding on the 128-bit instructions in Intel SSE4 today, Intel AVX will offer 256-bit instructions that are drop-in replaceable with current 128-bit SSE instructions. In much the same way as current processors that support 64-bit registers can run 32-bit code by using half the register space, the AVX-capable processors will support Intel SSE instructions by using half of the registers.

Vectors will run eight-wide rather than the four-wide that architectures can handle today, thus increasing the potential for low-level data-level parallelism and providing improvements for gaming performance. Sandy Bridge will also provide higher core counts and better memory performance allowing for more threading at a fine-grained task-based level.

Figure 17. Larrabee will bring processor-type architectures to the graphics community.

The multi-core Larrabee architecture (Figure 17) is the other upcoming technology of interest. Though ordained as a GPU, the design basically takes the current multi-core processor to the next level. Everything mentioned in regard to threading performance, programming models, and affinity will likely apply to this architecture, though scaled to a much higher core count. Developers planning for the future should consider how to scale beyond the Intel Core i7 processor and even Sandy Bridge and look at Larrabee in terms of cache and communication between threads.

Closing Thoughts

The desktop processor is evolving at a rapid rate, and although the performance enhancements are offering developers a huge opportunity to improve realism and fidelity, they have also created new problems requiring different programming models. Developers can no longer make the assumption that the code they write today will automatically run better on next year's hardware. Instead they need to plan for upcoming hardware architectures and write code that easily adapts to the changing processor environments, especially in regard to the number of available threads.

One side effect of the evolving hardware is the increasing importance of performance testing. Using the correct tools and developing a structured performance testing suite with defined workloads is critical to making sure an application performs its best on as wide a range of hardware as possible.

Testing early in a game's design cycle on a variety of hardware is the best possible way to ensure design decisions offer the maximum long-term performance solution. Many off the issues outlines during the article can be identified using tools such as Intel VTune Performance Analyzer which can monitor cache behavior and the instruction throughput in a particular thread; in addition there are tools such as the Intel Thread Profiler designed to measure thread concurrency.

Optimization guidelines for the processors mentioned in the article as well as other Intel processors can be found at

Article Start Previous Page 7 of 7

Related Jobs

Remedy Entertainment
Remedy Entertainment — Espoo, Finland

Senior Material Artist
Remedy Entertainment
Remedy Entertainment — Espoo, Finland

Development Manager (Xdev Team)
Remedy Entertainment
Remedy Entertainment — Espoo, Finland

Animation Director
innogames — Hamburg, Germany

Mobile Software Developer (C++) - Video Game: Forge of Empires

Loading Comments

loader image