| |
|
|
||||
![]() |
||||||
| |
|
|||||
|
Prefetching Data and Cache Instructions
The most appealing applications for the home PC market handle growing amounts of data – whether it be integer or floating-point. (Just think about the amount of texture and vertex data your next title will use.) Unfortunately, most of the data is out of the caches when it’s needed. The operation of loading and storing the data to and from the caches slows down the application while it waits for the data to become available. In some cases, the data address is known ahead of time, and the data could have been fetched in advance, reducing these waiting cycles. There are ways to do this with reads today, but it’s obvious that that the methods could be improved. To address this problem, the Streaming SIMD Extensions contain new instructions dedicated to memory streaming: the prefetches and the streaming stores. Some multimedia data types, such as the 3D display list, are referenced once and aren’t used again immediately. A programmer wouldn’t want a game’s cached code and data to be overwritten by this non-temporal data. The movntq/movntps (or streaming store) instructions let data be written directly to memory, thereby minimizing cache pollution. For data that you know you’ll use soon and often, there’s the prefetch instruction. This instruction lets you prefetch 32 bytes of data (a cache line on the Pentium III processor) before it’s actually used. All of these prefetch instructions can be used to prefetch data into the L1 cache, all cache levels, or all levels except L1. Table 1 shows the different uses of data prefetching.
Table 1 – Data Use vs. Prefetch Type While these instructions will retire quickly, they are used merely as a hint to the processor, and thus won’t generate any exceptions or faults. When prefetching data, it’s important to remember a these simple rules:
Branching As processor pipelines get deeper and deeper, branch mispredictions become more and more costly. There are a couple things you can do to deal with this problem. First, try to follow the branch prediction rules for the processor. The Pentium® III processor branch prediction rules are the same as the Pentium® II processor (see Sidebar 1, "Understanding the Pentium® II Processor"). Second, you can simply remove the branch where appropriate. Take the following example where we’re using logical instructions to remove branches: C++: a = (a < b) ? c : d ; Only doing a single compare here and there is a branch. Assembly: cmpps xmm0, xmm1, 1 movaps xmm2, xmm0 andps xmm0, xmm3 andnps xmm2, xmm4 orps xmm0, xmm2 Or, say you wanted to simply perform a clamp on an angle. You could use either the MINPS or MAXPS instruction to apply the clamp to four values at once. Here we’re using MINPS to clamp a vector to one (1.0f). C++: a = (a > b) ? b :
a Assembly: minps xmm0, xmm1
|
|
|