CMP Game Media Group Presents: Home
  JoinHelpContact UsShop

Newswire
Features
Connection
Job Search
Directories
By Rob Wyatt
Gamasutra
May 28, 1999
Vol. 3: Issue 21

Features
Wyatt's World

Cracking Open The Pentium III

Contents

What is all the fuss about?

How do I detect the new instructions?

What operating system support is required for the Pentium III?

What are these new SIMD instructions?

How do I make use of the new instructions?

How do I debug code with the new instructions?

How do I read the new Pentium III serial number?

Is there any new performance/ profiling information?

How do I debug code with the new instructions?

Currently the way of viewing the SIMD registers or disassembling SIMD code within the application environment is to use the Intel Register Viewing Tool -- examining SIMD code within the Visual C++ dissembler will reveal nothing. The Register Viewing Tool is a stand-alone tool and is not linked to the Visual C++ IDE in any way, apart from being accessible from the tool menu. The register view highlights changes to individual elements of a register, but the update is not instant. The tool only updates the registers every half second (which is fast enough while you work in the debugger) and you can always click on the ‘Refresh’ button for a instantaneous update. In addition to viewing the registers in floating point format, they can be viewed in byte, word and dword formats.

Viewing SIMD code with
Intel's Register Viewing Tool
[zoom]

The disassembly window displays code at the current address, which is marked with an asterisk ‘*’ and 40 bytes either side. The current address line moves down as you step through in the debugger, and it also displays the correct address when you hit a break point. The only inconvenient aspect of this tool is that when a breakpoint is first hit, an int 3 instruction (debugger break) is shown in the disassembly window, and you have to enter the first byte of actual instruction if you want an accurate disassembly. Usually for a SIMD instruction, the first byte will be 0x0f for packed scalar instructions and 0xf3 for single scalar instructions. This little problem aside, it is otherwise a very useable and essential tool if you are serious about writing SIMD code. Using the register viewing tool and some good old-fashioned debugging techniques, you will get by fine. Hopefully Microsoft will be quicker in implementing SIMD debug support than they were in implementing MMX support.

For low level debugging there is a new version of SoftICE (version 3.25) that supports the Pentium III registers and instructions. The latest version is available for download from the Numega web site at http://www.numega.com/drivercentral/components/si325.shtml (this is free to anyone who has a registered version of SoftICE 3.20 or higher).

Programming Considerations

Like I said before, I am disappointed that there are no dot product/inner product instructions. Having such an instruction could have made a huge difference for lighting and collision calculation performance. Fortunately, calculating a dot product with the new instructions can be done in just a few cycles -- significantly faster than using older x87 floating-point methods. The code below performs a simple dot-product between two vectors and places the resulting value in all positions so that it’s ready to use. By carefully scheduling and interleaving the neighboring operations, this code could go significantly faster.

// Load the vectors
Movaps xmm0, lv xmm0 = [-, lz, ly, lx]
movaps xmm1, nv xmm1 = [-, nz, ny, nx]

// Do the math
mulps xmm0, xmm1 xmm0 = [-, lz*nz, ly*ny, lx*nx]
movaps xmm2, xmm0 xmm2 = [-, lz*nz, ly*ny, lx*nx]
shufps xmm0, xmm0,9 xmm0 = [-, lx*nx, lz*nz, ly*ny]

addps xmm0, xmm2 xmm0 =
[-,lx*nx+lz*nz,lz*nz+ly*ny,lx*nx+ly*ny]

shufps xmm2, xmm2,18 xmm2 = [-, ly*ny, lx*nx, lz*nz]
addps xmm0, xmm2 xmm0 = [-, dp, dp, dp]

 

To get the most out of SIMD instructions, you must ensure that every register element performs a useful operation on every instruction. For example, if you place a single 3D vector into a SIMD register, at most you will get 75% of the maximum possible throughput. You can see in the above dot product example that only three useful operations are performed by each instruction. Not using all of the elements within a register means that the unused elements could contain unknown values. These unknown values generally cause no harm, but be careful when issuing divide and square root instructions -- especially if exceptions are enabled.

A modification to the above code can be used to perform vector normalization:

// Load the vector
movaps xmm0, v xmm0 = [-, z, y, x]

// Do the math
movaps xmm1, xmm0 xmm1 = [-, z, y, x]
mulps xmm0, xmm0 xmm0 = [-, z*z, y*y, x*x]
movaps xmm2, xmm0 xmm2 = [-, z*z, y*y, x*x]
shufps xmm0, xmm0,9 xmm0 = [-, x*x, z*z, y*y]
addps xmm0, xmm2 xmm0 = [-, x*x+z*z,z*z+y*y,x*x+y*y]
shufps xmm2, xmm2,18 xmm2 = [-, y*y, x*x, z*z]
addps xmm0, xmm2 xmm0 =
[-, x*x+y*y+z*z,x*x+y*y+z*z,x*x+y*y+z*z]

sqrtps xmm0, xmm0 xmm0 = [-, len, len, len]
divps xmm1, xmm0 xmm1 = [-, unit z, unit y, unit x]

This produces results with full precision accuracy, and takes about 100 cycles -- not significantly faster than the same operation in x87 floating-point format. If the vector must be calculated using full precision, then a significant speed can be gained by taking advantage of the fact that square root and divide instructions (in bold) both work on vectors containing the same value in each element. The single square root and divide instructions are faster than the packed ones, and replacing the last two instructions in the above example with the four instructions below will save around 40 cycles, making this code about twice as fast as the equivalent x87 floating-point code.

sqrtss xmm0, xmm0 xmm0 = [-, -, -, len]
divss xmm0, xmm0 xmm0 = [-, -, -, 1/len)
shufps xmm0, xmm0, 0 xmm0 = [1/len, 1/len, 1/len, 1/len]
mulps xmm1, xmm0 xmm1 = [-, unit z, unit y, unit x]

It is unlikely that a vector normalization would require full precision, so take advantage of the approximate reciprocal instructions to speed things up. Again, replacing the square root and divide instructions in the original code (bold type) with the two below will reduce the overall time to around 16 cycles, which is much faster than anything in x87 floating point format – it’s even faster than using a lookup table, as this method does not thrash the cache.

rsqrtps xmm0, xmm0 xmm0 = [-, 1/len, 1/len, 1/len]
mulps xmm1, xmm0 xmm1 = [-, unit z, unit y, unit x)

With these vector normalization routines, you need to be careful of the unknown value if exceptions are enabled. If exceptions are disabled, the SIMD unit provides reasonable values when an exception occurs.

In the examples so far, we placed a whole vector in a single SIMD register, which is known as "horizontal data processing", or the AoS (Array of Structures) method. As the name implies, if you process a set of vectors with AoS, then each vector is a structure and you have an array of them (you probably process 3D geometry this way frequently). The C code below shows a typical AoS layout for 1024 vectors.

struct Vector3
{

float X;
float Y;
float Z;

};

Vector3 SOA_Data[1024];

Another problem with the above data layout is alignment. Each vector is only 12 bytes, but the SIMD movaps instruction must fetch data from a memory address that is 16-byte aligned. To fix this problem you could use four element vectors, but if it’s not required it may not be worth the additional 33% of storage it requires. Alternatively, you can use the movups instruction, which can read unaligned data, but storing the data in an unaligned address suffers a penalty. The alignment restrictions also apply to any SIMD instruction that directly references memory, such as addps xmm1,[eax]. If the alignment restrictions are not satisfied, a general protection fault will be generated.

It is common knowledge in digital signal processing and SIMD programming that using AoS is not the most efficient method of representing data such as vertices. Vertical programming, also known as the SoA (Structure of Arrays) method, is significantly faster. In this method, each element of the vectors is stored in an array, so in our example we would have an array of X components, another of Y components and an array of Z components, created like this:

struct AOS_Data

{

float x[1024];
float y[1024];
float z[1024];

};

Now consider the unoptimized and unscheduled code below to normalize a set of vectors using the SoA method:

//Load the data for 4 vectors
movaps xmm0, X xmm0 = [x3, x2, x1, x0]
movaps xmm1, Y xmm1 = [y3, y2, y1, y0]
movaps xmm2, Z xmm2 = [z3, z2, z1, z0]

//keep a copy
movaps xmm3, xmm0 xmm3 = [x3, x2, x1, x0]
movaps xmm4, xmm1 xmm4 = [y3, y2, y1, y0]
movaps xmm5, xmm2 xmm5 = [z3, z2, z1, z0]

//Do the math
mulps xmm0,xmm0 xmm0 = [x3x3, x2x2, x1x1, x0x0]
mulps xmm1,xmm1 xmm1 = [y3y3, y2y2, y1y1, y0y0]
mulps xmm2,xmm2 xmm2 = [z3z3, z2z2, z1z1, z0z0]
addps xmm0,xmm1 xmm0 = [x3x3+y3y3, x2x2+y2y2, x1x1+y1y1, x0x0+y0y0]

addps xmm0,xmm2 xmm0 = [x3x3+y3y3+z3z3, x2x2+y2y2+z2z2, x1x1+y1y1+z1z1, x0x0+y0y0+z0z0]

rsqrtps xmm0,xmm0 xmm0 = [1/len3, 1/len2,1/len1, 1/len0]

mulps xmm3,xmm0 xmm3 = [unit x3, unit x2, unit x1, unit x0]

mulps xmm4,xmm0 xmm4 = [unit y3, unit y2, unit y1, unit y0]

mulps xmm5,xmm0 xmm5 = [unit z3, unit z2, unit z1, unit z0]

Even in its unoptimized form, the gains are huge. The above code takes less than 30 cycles to perform four vector normalization operations. The primary reason for the huge speed gain is that in every instruction, every element of each register performs a useful operation (which is almost impossible if you use three element vectors in the SoA format). And the advantages don’t stop there. Most of the alignment issues are avoided as well, since only the individual arrays have to be aligned to 16-byte boundaries. Another advantage is that the number of elements in the vector is independent of the register size, so code using this format is easy to convert to using AMD’s 3DNow! instructions.

Laying data out in the SoA format does not help in all cases, however. It has disadvantages, too. The disadvantages are usually caused by human errors, since this method requires a different way of thinking. However, once you understand it, it’s not really much different than the AoS format. The SoA format is really only useful for processing arrays of vectors -- it’s inefficient for single operations. But with a little thought your part, it’s not difficult to convert from the AoS format to SoA.


How do I read the new Pentium III serial number?
 


Home | Join | Help | Contact Us | Shop | Newswire | Site Map | Calendar
Write for Us | Features | Connection | Job Search | Directories


Copyright © 2000 CMP Media Inc. All rights reserved.
Privacy Policy