It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.

Search articles, jobs, buyers guide, and more.

By Peter Baker and Kim Pallister
Gamasutra
March 26, 1999

Letters to the Editor:
Write a letter
View all letters


Features

Data Movement and Data Manipulation

Contents

Introduction

New Floating-Point Registers

New Data Types, New Instructions

Data Movement and Data Manipulation

Prefetching Data and Cache Instructions

Type Conversion

Background

Understanding the Pentium II Processor

SIMD Explained

For everyday data movement, the Streaming SIMD Extensions provide move instructions. The movaps (move aligned packed single) and movups (move unaligned packed single) instructions transfer 128 bits of packed data from memory to one of the XMM registers and vice-versa, or between XMM registers. The faster movaps instruction can be used if the data is aligned on a 16-byte boundary.

Remember that this is a four-way parallel instruction set; we want to get the most parallelism out of the code as we can. Let’s say your data wasn’t laid out in memory four in a row. To that end, some data manipulation may be required. Since we’re using a packed data type, it’s important to provide ways to get the data into the correct format for optimal use by the instruction set. To that end, the instruction set now has instructions for performing data manipulations like shuffles, 64-bit moves, packing and unpacking, inserts and extracts.

For instance, say you want to perform simple dot products. In most 3D engines, data is laid out in a simple structure like this (where ‘w’=1):

struct vertice {

float x, y, z, w;
float nx, ny, nz;
float u, v;

}

Then the following code performs the dot products:

for (i=0;...)
{

FR3 = ((X*m00) + (Y*m01) + (Z*m02) + mat03);
.
.
.

}

Which performs operations as described in figure 6.


Figure 6 – Non-optimal data layout
[zoom] 

 

In Figure 6, you can see that we’re wasting 25% of our execution bandwidth in the multiply (we really only have to do three multiplies, assuming w=1), and we suffer from the additional overhead of three shuffles and three adds to get the final result.

Optimally, the data should be set up in a parallel format, so that the four dot products could be done with three multiplies and three adds, as shown in Figure 7. These parallel calculations can be done with the Streaming SIMD Extensions in the same time it took to do the one dot product on the Pentium II processor.


Figure 7 – Optimal SIMD data layout
[zoom] 

 

How do you go about reordering the data? One method is to use the 64-bit movhps dest, src (see Figure 8) and shuffle shufps dest, src, mask (see Figure 9). The 64-bit move instructions can be used to move 64 bits representing two single precision operands to and/or from the either the upper or lower 64-bits of the src to the dest.


Figure 8 – movhps instruction

 

The shuffle can be used to rotate, shift, swap and broadcastdata between two registers or within one register (if both src and dest are the same), under the control of a mask. The mask contains eight bits; two bits for each data element in the dest. Bits 0 and 1 of the immediate field are used to select which of the four input numbers will be used as the first number of the result; bits 2 and 3 of the immediate field are used to select which of the four input numbers will be used as the second number, and so on.


Figure 9 – shufps instruction
[zoom] 

 

Now we’ll show an example of how these instructions can be used to reorganize vertex data. (The "-" symbol in the comments below denotes a "don’t care".)

// Where xmm7 = -z0y0x0; xmm2 = -z1y1x1;
// xmm4 = -z2y2x2; xmm3 = -z3y3x3
// Reorder the input vertices to be
// in xxxx,yyyy,zzzz format

movhps temp1, xmm7
// Use 64-bit moves to move the high 64-bits…

movhps temp2, xmm4
// Save the Z0, Z2 values from these vectors

shufps xmm7, xmm2, 0x44
// xmm7 = y1,x1,y0,x0

shufps xmm4, xmm3, 0x44
// xmm4 = y3,x3,y2,x2

movaps xmm5, xmm7
// save content of register to extract
// the X elements later

shufps xmm7, xmm4, 0xDD
// xmm7 = y3,y2,y1,y0

shufps xmm5, xmm4, 0x88
// xmm5 = x3,x2,x1,x0

movhps xmm6, temp1
// mov the Z0 element from memory
// to the reg xmm6 = -,z0,y1,x1

shufps xmm6, xmm2, 0x22
// xmm6 = -,z1,-,z0

movhps xmm2, temp2
// mov the Z3 element from memory
// to the reg xmm6 = -,z3,y3,x3

shufps xmm2, xmm3, 0x22
// xmm2 = -,z3,-,z2

shufps xmm6, xmm2, 0x88
// xmm6 = z3,z2,z1,z0


Prefetching Data and Cache Instructions


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2003 CMP Media LLC

privacy policy
| terms of service