CMP Game Media Group Presents: Home
  JoinHelpContact UsShop

Newswire
Features
Connection
Job Search
Directories
By Rob Wyatt
Gamasutra
May 28, 1999
Vol. 3: Issue 21

Features
Wyatt's World

Cracking Open The Pentium III

Contents

What is all the fuss about?

How do I detect the new instructions?

What operating system support is required for the Pentium III?

What are these new SIMD instructions?

How do I make use of the new instructions?

How do I debug code with the new instructions?

How do I read the new Pentium III serial number?

Is there any new performance/ profiling information?

How do I make use of the new instructions?

The best way to write code for Pentium III is to use version 4.0 of the Intel C/C++ compiler. This compiler, which comes with Intel’s VTune, is a replacement for the Microsoft C/C++ compiler that Visual C/C++ uses. The advantage of using the Intel compiler is that you can still use the IDE, debugger, linker and tools that you are familiar with, and there is no learning curve. If you prefer C++, the Intel compiler is a much better implementation of the language than Microsoft’s version.

The Intel compiler supports the new Pentium III instructions via its vectorizing code generator. As such, it successfully generates SIMD instructions. The new instructions are also supported within the inline assembler, and if you don’t want to code in assembly, there is a new Intel compiler-specific SIMD data type called __m128 and a set of intrinsic C functions, so optimized code can be developed without assembly language. For every SIMD instruction, there is a compiler function to do the same thing. For example,

__m128 __mm_add_ps(__m128, __m128)

adds the two specified SIMD data types together and can compile to a single instruction. The __m128 type can be made into a union with an array of floats if access to the individual floats is required. The only caveat is that this operation requires the __m128 to be in memory, because there are no instructions to move floating point data between the SIMD and x87 registers. Intel attempts to do this for you; it tries to maintain code portability via an abstract class called f32Vec4. This is a C++ class which has inline member functions with the same names as the intrinsic C functions. When used, this class generates exactly the same code as that generated by the intrinsic functions, so an application’s performance should not be affected. The benefit of using the class is that it can be re-implemented using x87 (or even AMD’s 3DNow!) without having to change your source code. For full documentation of the intrinsic types, the C++ classes and the vectorizing compiler options, look at appendix C in volume 2 of the Intel Architecture Reference Manual, or the Intel optimizations reference manual, both which are available from Intel’s developer web site or on the VTune CD-ROM.

I assume most professional game programmers use Visual C++, and if so, these people may not want to change compilers just to access the new instructions. If you are in this position, all is not lost. Here are a couple of options for using the new instructions within Visual C++.

First, the Intel compiler produces object files that are binary compatible with Visual C++, including C++ name mangling. With this in mind, you can separate the individual C functions or C++ class members that require Pentium III optimizations into a separate source files, and then compile them with the Intel compiler. To make switching compilers even easier, Intel implemented a #pragma that lets you select the compiler on a source-file basis.

Alternatively, Microsoft has updated MASM via a patch to include these new instructions. The latest MASM version is 6.14, and the patch will update versions 6.11, 6.11a, 6.11d, 6.12 and 6.13. The ML614.exe patch is available from the Microsoft web site at http://support.microsoft.com/download/support/mslfiles/ml614.exe. Alternatively, Intel provides an include file for MASM that defines macros for the Pentium III instructions and this include file works with all versions of MASM. If your build environment includes MASM, one of these options may be the way to go. The file is called IAXMM.INC and is available from Intel’s developer web site.

The MASM include file inspired me to build a set of macros that emits the opcode bytes directly into the code stream, thereby allowing any compiler to use the Pentium III instructions. This turned out to be a little more difficult than I anticipated, however, mainly because of inline assembly code restrictions. The instruction macros I provide here are not ideal, but they do the job. And for small sections of inline assembly code, the instructions are perfectly adequate and can make a huge difference.

To create these macros, I first defined the register names and their respective values. The standard register names are reserved words, so they cannot be used. Further, the SIMD register names will be reserved words in a future version of Visual C++ (they are reserved words in the Intel compiler), so it’s better not use them, either. In the end, I decided to call the SIMD registers _XMM0 to _XMM7, and I called the MMX registers _MM0 to _MM7. The integer registers have two forms, depending on whether they are as address pointers or not. The pointer versions of the standard registers are called EAX_PTR, EBX_PTR, and so on, and the register versions are called EAX_REG, EBX_REG, and so on.

Opcode 0f 58 /r is the addps instruction, where "/r" means either a register or memory pointer for the source operand. Fortunately, only registers can be destinations within most SIMD instructions, so there are only two forms of the instructions. Looking at the encoding of the "/r" component of the instructions you’ll notice that it’s a standard "mod/rm-sib-offset" (for example, [eax*2+ebx+offset]), just like the any other instruction. With this in mind, the register-to-register version of the instruction (addps xmm,xmm) becomes trivial to encode because the one-byte "/r" component (the mod/rm byte) is laid out as follows:

Bit 7 Bit 6 Bits 5-3 Bits 0-2

1

1

Dst XMM register

Src XMM register

The following macro assembles the instruction:

#define ADDPS_REG(dst,src) \

{ \
_asm _emit 0x0f \
_asm _emit 0x58 \
_asm _emit 0xc0 | ((dst)<<3) | (src) \
}

This would simply be used as ADDPS_REG(_XMM0,_XMM1) from either inside or outside of an assembly code block.

The register-to-register form of the instructions is of no use unless we can also use the memory form of the instructions to load data. If we look at the same instruction with a 32-bit integer register pointing to the data, the "/r" component of the instruction remains a single byte. It is laid out as:

Bit 7 Bit 6 Bits 5-3 Bits 0-2

0

0

Dst XMM register

Src Integer register ptr

Like before, we can define a macro to assemble this instruction:

#define ADDPS_MEM(dst,src) \
{ \

_asm _emit 0x0f \
_asm _emit 0x58 \
_asm _emit 0xc0 | ((dst)<<3) | (src) \

}

This would be used as ADDPS_MEM(_XMM1,EAX_PTR) and would add the 128-bit value pointed to by EAX to the contents of the XMM1 register.

It would be nice if both of these macro forms could be combined into a single macro, so that you could easily switch from a register to memory pointer. If you define the registers as shown in the table below, the following macro will successfully assemble both forms of the instruction.

Register

Value

Register

Value

Register

Value

Register

Value

_XMM0

0xC0

_MM0

0xC0

EAX_PTR

0x00

EAX_REG

0xC0

_XMM1

0xC1

_MM1

0xC1

EBX_PTR

0x03

EBX_REG

0xC3

_XMM2

0xC2

_MM2

0xC2

ECX_PTR

0x01

ECX_REG

0xC1

_XMM3

0xC3

_MM3

0xC3

EDX_PTR

0x02

EDX_REG

0xC2

_XMM4

0xC4

_MM4

0xC4

ESI_PTR

0x06

ESI_REG

0xC6

_XMM5

0xC5

_MM5

0xC5

EDI_PTR

0x07

EDI_REG

0xC7

_XMM6

0xC6

_MM6

0xC6

ESP_PTR

0x04

ESP_REG

0xC4

_XMM7

0xC7

_MM7

0xC7

EBP_PTR

0x05

EBP_REG

0xC5

#define ADDPS(dst,src) \
{ \

_asm _emit 0x0f \
_asm _emit 0x58 \
_asm _emit ((dst & 0x3f)<<3) | (dst) \

}

This macro is simply used as ADDPS(_XMM0, _XMM1) for the register version, or ADDPS(_XMM0, EAX_PTR) for the memory version. In the KNI.h header file, similar macros are provided for all the new Pentium III instructions.

Single register indirect addressing is the only addressing mode that the macros support, which can be restrictive compared to the functionality of a proper assembler. While using the macros, all addressing modes can be achieved by using an LEA instruction to calculate the address and use the result in the macro. While this method takes two instructions, it’s usually not too difficult to schedule the LEA between some other instructions where the processor would have otherwise been stalled.

The opcode determines the type of registers used within a given instruction. (The possibilities are shown in the instruction tables above.) However, because the macros cannot perform any error checking, it is possible to assemble what appear to be illegal instructions. For example, the instruction ADDPS(EAX_REG,EBX_REG) is invalid, but it actually assembles to the valid ADDPS xmm0, xmm3 instruction. With this in mind, you have to be very careful when using the macros, because simple typos can lead to bizarre side effects.

The only SIMD instructions that can take a memory operand as the destination are the various move instructions, such as MOVAPS or MOVUPS, and these move instructions actually have different op-codes for storing and therefore require a different macro. To keep things simple, a storing version of an instruction has a postfix of _ST. For example, the instruction

MOVAPS [eax], xmm0

becomes

MOVAPS_ST(EAX_PTR,_XMM0)

when using the macros. The KNI.H header file contains macros for all the SIMD instructions and constants for registers.

Note that both Visual C++ and the Intel compiler know what assembly instructions modify what registers. Using this information, the compilers store their working registers around assembly blocks for only the registers used within the assembly block, resulting in more optimal code. If instructions are directly emitted into the code stream by using the _emit operator, the compiler does not know what registers are used and attempts no guesses. As a result, you may corrupt a register the compiler is using and was not saved.


How do I debug code with the new instructions?
 


Home | Join | Help | Contact Us | Shop | Newswire | Site Map | Calendar
Write for Us | Features | Connection | Job Search | Directories


Copyright © 2000 CMP Media Inc. All rights reserved.
Privacy Policy