| CMP Game Media Group Presents: | ||
![]() |
|
|
|
|
||
|
Cracking Open The Pentium III
How do I make use of the new instructions? The best way to write code for Pentium III is to use version 4.0 of the Intel C/C++ compiler. This compiler, which comes with Intel’s VTune, is a replacement for the Microsoft C/C++ compiler that Visual C/C++ uses. The advantage of using the Intel compiler is that you can still use the IDE, debugger, linker and tools that you are familiar with, and there is no learning curve. If you prefer C++, the Intel compiler is a much better implementation of the language than Microsoft’s version. The Intel compiler supports the new Pentium III instructions via its vectorizing code generator. As such, it successfully generates SIMD instructions. The new instructions are also supported within the inline assembler, and if you don’t want to code in assembly, there is a new Intel compiler-specific SIMD data type called __m128 and a set of intrinsic C functions, so optimized code can be developed without assembly language. For every SIMD instruction, there is a compiler function to do the same thing. For example, __m128 __mm_add_ps(__m128, __m128) adds the two specified SIMD data types together and can compile to a single instruction. The __m128 type can be made into a union with an array of floats if access to the individual floats is required. The only caveat is that this operation requires the __m128 to be in memory, because there are no instructions to move floating point data between the SIMD and x87 registers. Intel attempts to do this for you; it tries to maintain code portability via an abstract class called f32Vec4. This is a C++ class which has inline member functions with the same names as the intrinsic C functions. When used, this class generates exactly the same code as that generated by the intrinsic functions, so an application’s performance should not be affected. The benefit of using the class is that it can be re-implemented using x87 (or even AMD’s 3DNow!) without having to change your source code. For full documentation of the intrinsic types, the C++ classes and the vectorizing compiler options, look at appendix C in volume 2 of the Intel Architecture Reference Manual, or the Intel optimizations reference manual, both which are available from Intel’s developer web site or on the VTune CD-ROM. I assume most professional game programmers use Visual C++, and if so, these people may not want to change compilers just to access the new instructions. If you are in this position, all is not lost. Here are a couple of options for using the new instructions within Visual C++. First, the Intel compiler produces object files that are binary compatible with Visual C++, including C++ name mangling. With this in mind, you can separate the individual C functions or C++ class members that require Pentium III optimizations into a separate source files, and then compile them with the Intel compiler. To make switching compilers even easier, Intel implemented a #pragma that lets you select the compiler on a source-file basis. Alternatively, Microsoft has updated MASM via a patch to include these new instructions. The latest MASM version is 6.14, and the patch will update versions 6.11, 6.11a, 6.11d, 6.12 and 6.13. The ML614.exe patch is available from the Microsoft web site at http://support.microsoft.com/download/support/mslfiles/ml614.exe. Alternatively, Intel provides an include file for MASM that defines macros for the Pentium III instructions and this include file works with all versions of MASM. If your build environment includes MASM, one of these options may be the way to go. The file is called IAXMM.INC and is available from Intel’s developer web site. The MASM include file inspired me to build a set of macros that emits the opcode bytes directly into the code stream, thereby allowing any compiler to use the Pentium III instructions. This turned out to be a little more difficult than I anticipated, however, mainly because of inline assembly code restrictions. The instruction macros I provide here are not ideal, but they do the job. And for small sections of inline assembly code, the instructions are perfectly adequate and can make a huge difference. To create these macros, I first defined the register names and their respective values. The standard register names are reserved words, so they cannot be used. Further, the SIMD register names will be reserved words in a future version of Visual C++ (they are reserved words in the Intel compiler), so it’s better not use them, either. In the end, I decided to call the SIMD registers _XMM0 to _XMM7, and I called the MMX registers _MM0 to _MM7. The integer registers have two forms, depending on whether they are as address pointers or not. The pointer versions of the standard registers are called EAX_PTR, EBX_PTR, and so on, and the register versions are called EAX_REG, EBX_REG, and so on. Opcode 0f 58 /r is the addps instruction, where "/r" means either a register or memory pointer for the source operand. Fortunately, only registers can be destinations within most SIMD instructions, so there are only two forms of the instructions. Looking at the encoding of the "/r" component of the instructions you’ll notice that it’s a standard "mod/rm-sib-offset" (for example, [eax*2+ebx+offset]), just like the any other instruction. With this in mind, the register-to-register version of the instruction (addps xmm,xmm) becomes trivial to encode because the one-byte "/r" component (the mod/rm byte) is laid out as follows:
The following macro assembles the instruction:
This would simply be used as ADDPS_REG(_XMM0,_XMM1) from either inside or outside of an assembly code block. The register-to-register form of the instructions is of no use unless we can also use the memory form of the instructions to load data. If we look at the same instruction with a 32-bit integer register pointing to the data, the "/r" component of the instruction remains a single byte. It is laid out as:
Like before, we can define a macro to assemble this instruction:
This would be used as ADDPS_MEM(_XMM1,EAX_PTR) and would add the 128-bit value pointed to by EAX to the contents of the XMM1 register. It would be nice if both of these macro forms could be combined into a single macro, so that you could easily switch from a register to memory pointer. If you define the registers as shown in the table below, the following macro will successfully assemble both forms of the instruction.
This macro is simply used as ADDPS(_XMM0, _XMM1) for the register version, or ADDPS(_XMM0, EAX_PTR) for the memory version. In the KNI.h header file, similar macros are provided for all the new Pentium III instructions. Single register indirect addressing is the only addressing mode that the macros support, which can be restrictive compared to the functionality of a proper assembler. While using the macros, all addressing modes can be achieved by using an LEA instruction to calculate the address and use the result in the macro. While this method takes two instructions, it’s usually not too difficult to schedule the LEA between some other instructions where the processor would have otherwise been stalled. The opcode determines the type of registers used within a given instruction. (The possibilities are shown in the instruction tables above.) However, because the macros cannot perform any error checking, it is possible to assemble what appear to be illegal instructions. For example, the instruction ADDPS(EAX_REG,EBX_REG) is invalid, but it actually assembles to the valid ADDPS xmm0, xmm3 instruction. With this in mind, you have to be very careful when using the macros, because simple typos can lead to bizarre side effects. The only SIMD instructions that can take a memory operand as the destination are the various move instructions, such as MOVAPS or MOVUPS, and these move instructions actually have different op-codes for storing and therefore require a different macro. To keep things simple, a storing version of an instruction has a postfix of _ST. For example, the instruction MOVAPS [eax], xmm0 becomes MOVAPS_ST(EAX_PTR,_XMM0) when using the macros. The KNI.H header file contains macros for all the SIMD instructions and constants for registers. Note that both Visual C++ and the Intel compiler know what assembly instructions modify what registers. Using this information, the compilers store their working registers around assembly blocks for only the registers used within the assembly block, resulting in more optimal code. If instructions are directly emitted into the code stream by using the _emit operator, the compiler does not know what registers are used and attempts no guesses. As a result, you may corrupt a register the compiler is using and was not saved. |
|
Copyright
© 2000 CMP Media Inc. All rights reserved.
Privacy Policy |