|
The most
appealing applications today for the home PC market handle growing amounts
of data. Video decoders and encoders manipulate big frame buffers, 3D
games handle large textures and data structures, and speech recognition
applications demand a lot of memory. Unfortunately, most of the data that
these applications need is out of the cache when it's most needed.
Transferring
data to and from the main memory slows down the application, forcing it
to wait for the data to become available. However, in some cases, the
data address is known ahead of time and the data could have been fetched
in advance, reducing these waiting cycles. Until now, those who have tried
to prefetch the data by simply reading it in advance discovered that this
trick does not work well.
With the
advent of the Streaming SIMD Extensions for the Pentium III processor,
game developers now can use instructions for controlling the cache, which
use prefetches and streaming stores. However, these instructions must
be used carefully, because excessive use may not lead to the expected
speed boost, and may even slow down your game.
About
the New Prefetch Instructions
The new
prefetch instructions of the Pentium III provide hints to the processor
– they suggest where into in the memory hierarchy the data should be prefetched.
However, there's no guarantee that the data will be fetched. The prefetches
do not affect the functionality of your code, apart from moving data blocks
along the memory hierarchy.
There are
dedicated prefetches for data of temporal locality (data that will be
accessed again in the short term) and for data of non-temporal locality
(data that is accessed only once). The four prefetch instructions are
as follows:
- PrefetchNTA.
This instruction is for non-temporal data. It only fetches into the
first level cache (L1), without polluting the second level cache (L2).
- PrefetchT0.
This instruction is used for temporal data that fits into L1. It fetches
into the whole cache hierarchy of L1 and L2.
- PrefetchT1.
This instruction is for temporal data that fits into L2 without polluting
L1.
- PrefetchT2.
This implementation in the Pentium III processor
is the same as for PrefetchT1.
(The Intel
Architecture Software Developer's Manual provides a complete architectural
description of the prefetch instructions, beyond their implementation
in the Pentium III processor.)
Unaligned
accesses are supported, and the prefetch does not split cache lines. In
other words, only one cache line of 32 bytes is fetched, which includes
the address of the prefetch. If the data is already in the desired cache
level, or closer to the processor, then it is not moved.
You cannot
prefetch data from uncacheable ranges such as Write Combining, AGP, video
frame buffer, and so on. These prefetches are treated as no operations
(NOPs).
An attempt
to prefetch an illegal address is ignored, meaning there is no data movement.
You do not get an exception. It is difficult to track pointer bugs in
the prefetch address (wrong address to prefetch from), which do not result
in improved performance.
Using
Prefetches Efficiently
Using prefetches
efficiently is more of an art than a science, but any developer can acquire
the necessary skills. To take full advantage of the prefetches, you must
follow several simple guidelines. Following these guidelines can make
the difference between great and acceptable prefetch optimizations. The
micro-architectural reasons for such guidelines are based on the organization
of the load buffer, the store buffer and the fill buffer as described
in the Intel Architecture Software Optimization Reference Manual.
The guidelines derived from the micro-architecture are:
- Prefetch
only when the probability of a cache miss is high (use the VTune analyzer
as described in the section below titled, "Analyzing
a Sample Application"). Redundant prefetches may carry some overhead,
so be thrifty about using prefetch instructions.
- Avoid
prefetching the same cache line (32 bytes) more than once. Instead,
unroll your loops to handle full cache lines per iteration, making it
easier to arrange the prefetching per iteration.
- Change
your data structures to include as much useful data as possible. Use
the structure of arrays (SoA) format instead of the array of structures
(AoS) format to increase the prefetching efficiency. The data is already
available, so make use of it now instead of prefetching it again later.
- Spread
the prefetches among other computational instructions, and if possible,
space them out, and don't use them around load instructions. This guideline
is tightly related to the following one, and is based on the processor
resources that both the prefetches and the loads share.
- Carefully
mix prefetches, streaming stores, loads and stores that miss the caches,
and stores to uncacheable memory range. All these instructions cause
data transaction to or from the main memory, and all of them share the
same valuable processor resource
|