It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.

Search articles, jobs, buyers guide, and more.

By Ornit Gross
Gamasutra
July 30, 1999

Letters to the Editor:
Write a letter
View all letters


Features

Pentium III Prefetch Optimizations
Using the VTune Performance Analyzer

Contents

Introduction

Analyzing a Sample Application

Choose the Right Prefetch Parameters

Other Features and Tools

The most appealing applications today for the home PC market handle growing amounts of data. Video decoders and encoders manipulate big frame buffers, 3D games handle large textures and data structures, and speech recognition applications demand a lot of memory. Unfortunately, most of the data that these applications need is out of the cache when it's most needed.

Transferring data to and from the main memory slows down the application, forcing it to wait for the data to become available. However, in some cases, the data address is known ahead of time and the data could have been fetched in advance, reducing these waiting cycles. Until now, those who have tried to prefetch the data by simply reading it in advance discovered that this trick does not work well.

With the advent of the Streaming SIMD Extensions for the Pentium III processor, game developers now can use instructions for controlling the cache, which use prefetches and streaming stores. However, these instructions must be used carefully, because excessive use may not lead to the expected speed boost, and may even slow down your game.

About the New Prefetch Instructions

The new prefetch instructions of the Pentium III provide hints to the processor – they suggest where into in the memory hierarchy the data should be prefetched. However, there's no guarantee that the data will be fetched. The prefetches do not affect the functionality of your code, apart from moving data blocks along the memory hierarchy.

There are dedicated prefetches for data of temporal locality (data that will be accessed again in the short term) and for data of non-temporal locality (data that is accessed only once). The four prefetch instructions are as follows:

  1. PrefetchNTA. This instruction is for non-temporal data. It only fetches into the first level cache (L1), without polluting the second level cache (L2).
  2. PrefetchT0. This instruction is used for temporal data that fits into L1. It fetches into the whole cache hierarchy of L1 and L2.
  3. PrefetchT1. This instruction is for temporal data that fits into L2 without polluting L1.
  4. PrefetchT2. This implementation in the Pentium III processor is the same as for PrefetchT1.

(The Intel Architecture Software Developer's Manual provides a complete architectural description of the prefetch instructions, beyond their implementation in the Pentium III processor.)

Unaligned accesses are supported, and the prefetch does not split cache lines. In other words, only one cache line of 32 bytes is fetched, which includes the address of the prefetch. If the data is already in the desired cache level, or closer to the processor, then it is not moved.

You cannot prefetch data from uncacheable ranges such as Write Combining, AGP, video frame buffer, and so on. These prefetches are treated as no operations (NOPs).

An attempt to prefetch an illegal address is ignored, meaning there is no data movement. You do not get an exception. It is difficult to track pointer bugs in the prefetch address (wrong address to prefetch from), which do not result in improved performance.

Using Prefetches Efficiently

Using prefetches efficiently is more of an art than a science, but any developer can acquire the necessary skills. To take full advantage of the prefetches, you must follow several simple guidelines. Following these guidelines can make the difference between great and acceptable prefetch optimizations. The micro-architectural reasons for such guidelines are based on the organization of the load buffer, the store buffer and the fill buffer as described in the Intel Architecture Software Optimization Reference Manual. The guidelines derived from the micro-architecture are:

  1. Prefetch only when the probability of a cache miss is high (use the VTune analyzer as described in the section below titled, "Analyzing a Sample Application"). Redundant prefetches may carry some overhead, so be thrifty about using prefetch instructions.
  2. Avoid prefetching the same cache line (32 bytes) more than once. Instead, unroll your loops to handle full cache lines per iteration, making it easier to arrange the prefetching per iteration.
  3. Change your data structures to include as much useful data as possible. Use the structure of arrays (SoA) format instead of the array of structures (AoS) format to increase the prefetching efficiency. The data is already available, so make use of it now instead of prefetching it again later.
  4. Spread the prefetches among other computational instructions, and if possible, space them out, and don't use them around load instructions. This guideline is tightly related to the following one, and is based on the processor resources that both the prefetches and the loads share.
  5. Carefully mix prefetches, streaming stores, loads and stores that miss the caches, and stores to uncacheable memory range. All these instructions cause data transaction to or from the main memory, and all of them share the same valuable processor resources.


Analyzing a Sample Application


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2003 CMP Media LLC

privacy policy
| terms of service