| |
|
|
||||
![]() |
||||||
| |
|
|||||
|
New Data Types, New Instructions
Probably the most useful new processor feature to game developers is the new Streaming SIMD Extensions (see Sidebar 2, "SIMD Explained"). After the unveiling of MMX technology with its SIMD operations on integer data types, it was clear that the instruction set architecture could be enriched to be more flexible and more adaptable to algorithms that used single-precision floating-point data. The Streaming SIMD Extensions were designed specifically to address the needs of algorithms that are:
Generally, you want to try to optimize code segments that are computationally expensive and that take most of the overall application processing time. The new instructions help accelerate applications that rely heavily on floating-point operations, such as 3D geometry and lighting, video processing, and high-end audio mixing. Before we delve into the instructions themselves, however, it makes sense to look at the type of data the instructions require. The principle data type of the Streaming SIMD Extensions is a new 128-bit data type (see figure 3). In most cases, this data type must be 16-byte aligned.
The new data types operate in the IEEE Standard 754 for binary floating-point arithmetic. This is a slight deviation from previous generations of Intel architectures, which used IEEE Standard 758 for representing floating-point numbers. Results from operations done with the Streaming SIMD Extensions and results obtained by the standard Intel architecture floating-point operations may not be bit exact. The 70 new instructions can be broken down into six basic categories. While we won’t list all of the instructions in this article (check out http://developer.intel.com for documentation) we’ll hit the highlights, and present an overview of each type of instruction and give an example of how each might be used, using examples in a subject near and dear to game developers: 3D graphics. Not too long ago, 3D application performance was limited by poorly performing accelerator hardware (or worse, no accelerator at all). Fast rasterization hardware has quickly become mainstream on PC platforms. The processor now has the difficult task of performing calculations fast enough on the geometry and lighting side of the 3D pipeline to keep the accelerator fed. The processor, as expected, has a number of instructions that perform arithmetic computations. These can be further sorted into two groups: full precision instructions and approximate precision instructions. Full precision instructions consist of all of those floating point operations you know and love for doing adds, subtracts, multiplies, and divides, and so on, which operate on the new Pentium III registers. There are also several approximate precision instructions for doing reciprocals and reciprocal square roots. The approximate precision instructions are extremely fast, but only return 11 bits of precision (rather than 23). These are useful for doing lighting, perspective projection and all kinds of other 3D graphics tasks for which 11 bits of precision is sufficient. For applications where more precision is required, you can use the following code to perform Newton Raphson iterations on the results, and get up to 22 bits of precision: // Newton
Raphson approximation for rcpps xmm1,
xmm0 // 1/tz This can be accomplished in half the time it takes to do a full-precision divide, which means that you get four results in less time than it takes to do one on a Pentium II processor. Each of the computational instructions has both a packed (denoted by a ps suffix -- see Figure 4) and a scalar (denoted by an ss suffix – see Figure 5) version. The difference between these two versions is that packed operations complete four operations with one instruction, whereas the scalar versions only operate on the least significant data element and leave the other three elements of the destination unchanged.
|
|
|