When hardware engineers presented us software developers with cache memory for the first time, essentially they gave us a black box. "Don’t worry about what’s inside of it," they assured us, "just trust us that without changing your programming habits, your programs are going to run much faster!" That was the goal: free software performance enhancement. While it worked fine for a while, over the years the rift between processor speed and memory access time became greater, and it became obvious that by opening the black box of cache memory and exploring the contents within, we’d find solutions to the widening performance gap.
The first steps into demystifying cache memory were modest, and the first tips that resulted from these explorations were vague. For instance, some of the first tips about making the best use of cache memory were non-specific suggestions such as, "avoid random memory access" and "keep your data local". With the appearance of dissociated data and code caches, new suggestions arose, like "keep your data and code strictly separated." The "allocate on write" technique came with its own set of advice. These days, the manual prefetch is in vogue, which takes advantage of dedicated instructions introduced with the new generation of Intel, AMD, PowerPC, and Mips processors. Unfortunately, these basic tips aren’t really satisfactory if you are serious about trying to optimize your game. As a result, the major new code optimization challenge facing game developers today has cache memory optimization.
Simply blindly applying vague suggestions like those to your game, without knowing their basis for operation, will not give you fully optimized code. In fact, it’s easy to find instances where strict code optimization and cache memory optimization conflict. Let’s look at an example.
Assume that you are computing a true-color image. You have the choice of representing your image using either a 24-bit RGB data structure or a 32-bit structure using an idle byte. Your first reflex would probably be to select the second solution for one simple reason: all your memory access will be 4-byte aligned, avoiding the three (at least) processor cycles needed to handle misaligned access. Usually, that’s the right choice, but if you’re using cache memory, it’s not always the way to go.
The technique you should use is implementation dependant. If you’re only going to look at the data once, a 24-bit structure may be more efficient than the byte-aligned 32-bit structure, but the opposite is true if the data is being accessed multiple times. Why? Because on the first pass, the image data is absent from the cache (assuming that we’re working on a processor without a prefetch instruction). Accessing the data from RAM and fetching that data into cache are expensive enough operations as to make the misalignment penalty insignificant. What we’re worried about during the first pass is not the shape of the data per se, but the raw amount of data that has to be moved. So to reduce the amount of data we need to fetch, we should use a 24-bit data structure as opposed to the 32-bit alternative.
At this point let me clarify something about optimizing your game for cache memory. The main handicap of cache memory is its lack of determinism and the difficulty we encounter when trying to tune its performance. Of course, Pentium event counters provide some useful help in this regard, but not enough to strictly optimize your code. Therefore the only real solution is, as usual, hard work. To that end I recommend purchasing a good book on the subject, such as The Cache Memory Book by Jim Handy (Academic Press, 1998).
This hard work will be rewarded, however. To convince you that the results of properly handling cache issues justify the added programming complications, let’s take a look at two detailed, real-life examples. The first example will convince you of one simple thing: accessing uncached data is expensive. The second (and more interesting) example will show you that sometimes changing a very small aspect of your game can solve a huge performance problem, provided you look deeply enough into the cache structure. There isn’t much actual code in these examples – they are primarily meant to illustrate that understanding and optimizing your data structures for cache can result in significant performance improvements.