It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.

Search articles, jobs, buyers guide, and more.

By Peter Baker and Kim Pallister
Gamasutra
March 26, 1999

Letters to the Editor:
Write a letter
View all letters


Features

Prefetching Data and Cache Instructions

Contents

Introduction

New Floating-Point Registers

New Data Types, New Instructions

Data Movement and Data Manipulation

Prefetching Data and Cache Instructions

Type Conversion

Background

Understanding the Pentium II Processor

SIMD Explained

The most appealing applications for the home PC market handle growing amounts of data – whether it be integer or floating-point. (Just think about the amount of texture and vertex data your next title will use.) Unfortunately, most of the data is out of the caches when it’s needed. The operation of loading and storing the data to and from the caches slows down the application while it waits for the data to become available. In some cases, the data address is known ahead of time, and the data could have been fetched in advance, reducing these waiting cycles. There are ways to do this with reads today, but it’s obvious that that the methods could be improved. To address this problem, the Streaming SIMD Extensions contain new instructions dedicated to memory streaming: the prefetches and the streaming stores.

Some multimedia data types, such as the 3D display list, are referenced once and aren’t used again immediately. A programmer wouldn’t want a game’s cached code and data to be overwritten by this non-temporal data. The movntq/movntps (or streaming store) instructions let data be written directly to memory, thereby minimizing cache pollution. For data that you know you’ll use soon and often, there’s the prefetch instruction. This instruction lets you prefetch 32 bytes of data (a cache line on the Pentium III processor) before it’s actually used. All of these prefetch instructions can be used to prefetch data into the L1 cache, all cache levels, or all levels except L1. Table 1 shows the different uses of data prefetching.

Data Use

Prefetch Type

Prefetch Instruction

Data will be used once

Prefetch into L1 only

PrefetchNTA

Data likely to be reused

Prefetch into all levels

PrefetchT0

Data likely to be reused, but not immediately

Prefetch to all levels except L1

PrefetchT1 / T2

Table 1 – Data Use vs. Prefetch Type

While these instructions will retire quickly, they are used merely as a hint to the processor, and thus won’t generate any exceptions or faults. When prefetching data, it’s important to remember a these simple rules:

  1. Choose the right type of prefetch
  2. Try to process a whole cache line (32 bytes) in one iteration
  3. Unroll the loops as necessary
  4. Make sure the CPU has some work to do while the data is being prefetched (i.e., don’t try to use the data right away)
  5. Treat the prefetch execution like a memory read when scheduling code.

Branching

As processor pipelines get deeper and deeper, branch mispredictions become more and more costly. There are a couple things you can do to deal with this problem. First, try to follow the branch prediction rules for the processor. The Pentium® III processor branch prediction rules are the same as the Pentium® II processor (see Sidebar 1, "Understanding the Pentium® II Processor"). Second, you can simply remove the branch where appropriate. Take the following example where we’re using logical instructions to remove branches:

C++:

a = (a < b) ? c : d ; Only doing a single compare here and there is a branch.

Assembly:

cmpps xmm0, xmm1, 1
;4 compares ("a" and "b") w/ one instruction –
;creates mask. This is also the beginning of the
;branch removal.

movaps xmm2, xmm0
;Save a copy of the mask.

andps xmm0, xmm3
;and(mask, c) | andnot(mask, d)

andnps xmm2, xmm4
;Where c=xmm3 and d=xmm4

orps xmm0, xmm2
;Final result as in the above C++ statement, but 4X.

Or, say you wanted to simply perform a clamp on an angle. You could use either the MINPS or MAXPS instruction to apply the clamp to four values at once. Here we’re using MINPS to clamp a vector to one (1.0f).

C++:

a = (a > b) ? b : a
; Only doing ONE compare here AND there is a branch.

Assembly:

minps xmm0, xmm1
;Where xmm0 = a; xmm1 = b = 1.0f;
;4 compares ("a" and "b") w/ one instruction
;Final result as in the above C++ statement, but 4X.


Type Conversion


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2003 CMP Media LLC

privacy policy
| terms of service