| |
|
|
||||
![]() |
||||||
| |
|
|||||
|
Pentium
III Prefetch Optimizations
The most appealing applications today for the home PC market handle growing amounts of data. Video decoders and encoders manipulate big frame buffers, 3D games handle large textures and data structures, and speech recognition applications demand a lot of memory. Unfortunately, most of the data that these applications need is out of the cache when it's most needed. Transferring data to and from the main memory slows down the application, forcing it to wait for the data to become available. However, in some cases, the data address is known ahead of time and the data could have been fetched in advance, reducing these waiting cycles. Until now, those who have tried to prefetch the data by simply reading it in advance discovered that this trick does not work well. With the advent of the Streaming SIMD Extensions for the Pentium III processor, game developers now can use instructions for controlling the cache, which use prefetches and streaming stores. However, these instructions must be used carefully, because excessive use may not lead to the expected speed boost, and may even slow down your game. About the New Prefetch Instructions The new prefetch instructions of the Pentium III provide hints to the processor – they suggest where into in the memory hierarchy the data should be prefetched. However, there's no guarantee that the data will be fetched. The prefetches do not affect the functionality of your code, apart from moving data blocks along the memory hierarchy. There are dedicated prefetches for data of temporal locality (data that will be accessed again in the short term) and for data of non-temporal locality (data that is accessed only once). The four prefetch instructions are as follows:
(The Intel Architecture Software Developer's Manual provides a complete architectural description of the prefetch instructions, beyond their implementation in the Pentium III processor.) Unaligned accesses are supported, and the prefetch does not split cache lines. In other words, only one cache line of 32 bytes is fetched, which includes the address of the prefetch. If the data is already in the desired cache level, or closer to the processor, then it is not moved. You cannot prefetch data from uncacheable ranges such as Write Combining, AGP, video frame buffer, and so on. These prefetches are treated as no operations (NOPs). An attempt to prefetch an illegal address is ignored, meaning there is no data movement. You do not get an exception. It is difficult to track pointer bugs in the prefetch address (wrong address to prefetch from), which do not result in improved performance. Using prefetches efficiently is more of an art than a science, but any developer can acquire the necessary skills. To take full advantage of the prefetches, you must follow several simple guidelines. Following these guidelines can make the difference between great and acceptable prefetch optimizations. The micro-architectural reasons for such guidelines are based on the organization of the load buffer, the store buffer and the fill buffer as described in the Intel Architecture Software Optimization Reference Manual. The guidelines derived from the micro-architecture are:
|
|
|