It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.

Search articles, jobs, buyers guide, and more.

By Peter Baker and Kim Pallister
Gamasutra
March 26, 1999

Letters to the Editor:
Write a letter
View all letters


Features

New Data Types, New Instructions

Contents

Introduction

New Floating-Point Registers

New Data Types, New Instructions

Data Movement and Data Manipulation

Prefetching Data and Cache Instructions

Type Conversion

Background

Understanding the Pentium II Processor

SIMD Explained

Probably the most useful new processor feature to game developers is the new Streaming SIMD Extensions (see Sidebar 2, "SIMD Explained"). After the unveiling of MMX technology with its SIMD operations on integer data types, it was clear that the instruction set architecture could be enriched to be more flexible and more adaptable to algorithms that used single-precision floating-point data.

The Streaming SIMD Extensions were designed specifically to address the needs of algorithms that are:

  1. Computationally intensive
  2. Inherently parallel
  3. Dependant on efficient cache utilization
  4. Single-precision floating point implementations

Generally, you want to try to optimize code segments that are computationally expensive and that take most of the overall application processing time. The new instructions help accelerate applications that rely heavily on floating-point operations, such as 3D geometry and lighting, video processing, and high-end audio mixing.

Before we delve into the instructions themselves, however, it makes sense to look at the type of data the instructions require. The principle data type of the Streaming SIMD Extensions is a new 128-bit data type (see figure 3). In most cases, this data type must be 16-byte aligned.


Figure 3 – the _m128 data type
[zoom] 

 

The new data types operate in the IEEE Standard 754 for binary floating-point arithmetic. This is a slight deviation from previous generations of Intel architectures, which used IEEE Standard 758 for representing floating-point numbers. Results from operations done with the Streaming SIMD Extensions and results obtained by the standard Intel architecture floating-point operations may not be bit exact.

The 70 new instructions can be broken down into six basic categories. While we won’t list all of the instructions in this article (check out http://developer.intel.com for documentation) we’ll hit the highlights, and present an overview of each type of instruction and give an example of how each might be used, using examples in a subject near and dear to game developers: 3D graphics.

Not too long ago, 3D application performance was limited by poorly performing accelerator hardware (or worse, no accelerator at all). Fast rasterization hardware has quickly become mainstream on PC platforms. The processor now has the difficult task of performing calculations fast enough on the geometry and lighting side of the 3D pipeline to keep the accelerator fed.

The processor, as expected, has a number of instructions that perform arithmetic computations. These can be further sorted into two groups: full precision instructions and approximate precision instructions. Full precision instructions consist of all of those floating point operations you know and love for doing adds, subtracts, multiplies, and divides, and so on, which operate on the new Pentium III registers.

There are also several approximate precision instructions for doing reciprocals and reciprocal square roots. The approximate precision instructions are extremely fast, but only return 11 bits of precision (rather than 23). These are useful for doing lighting, perspective projection and all kinds of other 3D graphics tasks for which 11 bits of precision is sufficient. For applications where more precision is required, you can use the following code to perform Newton Raphson iterations on the results, and get up to 22 bits of precision:

// Newton Raphson approximation for
// 1/tz = 2 * 1/tz - tz * 1/tz *1/tz
// the initial value, tz, assumed to be in xmm0

rcpps xmm1, xmm0 // 1/tz
mulps xmm0, xmm1 // tz * 1/tz
mulps xmm0, xmm1 // tz * 1/tz * 1/tz
addps xmm0, xmm1 // 2 * 1/tz
subps xmm1, xmm0 // tw = 2 * 1/tz -tz * 1/tz * 1/tz

This can be accomplished in half the time it takes to do a full-precision divide, which means that you get four results in less time than it takes to do one on a Pentium II processor.

Each of the computational instructions has both a packed (denoted by a ps suffix -- see Figure 4) and a scalar (denoted by an ss suffix – see Figure 5) version. The difference between these two versions is that packed operations complete four operations with one instruction, whereas the scalar versions only operate on the least significant data element and leave the other three elements of the destination unchanged.


Figure 4 - The packed versions of the instructions will operate on four data elements at a time

 


Figure 5 - The scalar versions of the instructions will operate on the least-significant data element only.


Data Movement and Data Manipulation


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2003 CMP Media LLC

privacy policy
| terms of service