CMP Game Media Group Presents: Home
  JoinHelpContact UsShop

Newswire
Features
Connection
Job Search
Directories
By Rob Wyatt
Gamasutra
May 28, 1999
Vol. 3: Issue 21

Features
Wyatt's World

Cracking Open The Pentium III

Contents

What is all the fuss about?

How do I detect the new instructions?

What operating system support is required for the Pentium III?

What are these new SIMD instructions?

How do I make use of the new instructions?

How do I debug code with the new instructions?

How do I read the new Pentium III serial number?

Is there any new performance/ profiling information?

What are these new SIMD instructions?

The tables below cover all the new Streaming SIMD Instructions for floating point and integer operations. The integer streaming SIMD instructions are actually extensions to MMX, work the same way as the existing MMX instructions, and use the same registers. All the floating-point operations have two forms of each instruction: a packed format indicated by instructions ending in "PS", and a single format indicated by instructions ending in "SS". The PS instructions perform operations on each of the four floating-point elements within a XMM register (Figure 1), whereas the SS instructions operate only on the bottom float, leaving the others untouched (Figure 2). The data is stored within XMM registers in a right-to-left order, so the value on the righthand side is the least significant 32 bits. Note that this can be confusing if you store a vector in memory as [x,y,z,w], because it appears as [w,z,y,x].

XMM0

8.0

6.0

4.0

2.0

*

*

*

*

XMM1

3.0

5.0

7.0

9.0

=

=

=

=

XMM0

24.0

30.0

28.0

18.0

Figure 1. Example of the MULPS xmm0,xmm1 instruction

 

XMM0

8.0

6.0

4.0

2.0

*

XMM1

3.0

5.0

7.0

9.0

=

=

=

=

XMM0

8.0

6.0

4.0

18.0

Figure 2. Example of the MULSS xmm0,xmm1 instruction

 

The next tables show the various Streaming SIMD operations. The two columns on the far right side of the table are the issue (throughput) and latency times for each instruction. For example, ADDPS can be issued every two cycles, and each instruction has a latency of four cycles. Unfortunately, there is a little more to scheduling than these simple timings, because the execution port and resource usage must be taken into account. These numbers give you a rough idea, though. For more information on decode scheduling, see the latest Intel optimization reference manual, available at http://developer.intel.com.

The "Src" and "Dst" columns in the following tables show possible locations for the source and destination operands of the various instructions. The following combination of symbols are used:

Xmm (Floating point SIMD Multimedia register)

Mmx (Integer MMX Multimedia register)

Mem (Memory address/Indirect address)

Reg (x86 integer register)

Mathematical operations

Dst

Src

Issue

Latency

ADDPS

Add packed scalar

Xmm

Xmm/Mem

2

4

ADDSS

Add single scalar

Xmm

Xmm/Mem

1

3

SUBPS

Subtract packed scalar

Xmm

Xmm/Mem

2

4

SUBSS

Subtract single scalar

Xmm

Xmm/Mem

1

3

MULPS

Multiply packed scalar

Xmm

Xmm/Mem

2

5

MULSS

Multiply single scalar

Xmm

Xmm/Mem

1

4

DIVPS

Divide packed scalar

Xmm

Xmm/Mem

38

38

DIVSS

Divide single scalar

Xmm

Xmm/Mem

18

18

SQRTPS

Square root packed scalar

Xmm

Xmm/Mem

58

58

SQRTSS

Square root single scalar

Xmm

Xmm/Mem

30

30

RCPPS

Reciprocal packed scalar

Xmm

Xmm/Mem

2

2

RCPSS

Reciprocal single scalar

Xmm

Xmm/Mem

2

2

RSQRTSS

Reciprocal square root single scalar

Xmm

Xmm/Mem

2

2

RSQRTPS

Reciprocal square root packed scalar

Xmm

Xmm/Mem

2

2

MAXPS

Maximum packed scalar

Xmm

Xmm/Mem

2

4

MAXSS

Maximum single scalar

Xmm

Xmm/Mem

1

4

MINPS

Minimum packed scalar

Xmm

Xmm/Mem

2

4

MINSS

Minimum single scalar

Xmm

Xmm/Mem

1

3

 

Conversion operations

Dst

Src

Issue

Latency

CVTPI2PS

Convert packed integer to packed scalar

Xmm

Mmx/Mem

1

3

CVTSI2SS

Convert single integer to single scalar

Xmm

Reg/Mem

2

4

CVTPS2PI

Convert packed scalar to packed integer

Mmx

Xmm/Mem

1

3

CVTSS2SI

Convert single scalar to single integer

Reg

Xmm/Mem

1

3

CVTTPS2PI

Convert packed scalar to packed integer, with truncate

Mmx

Xmm/Mem

1

3

CVTTSS2SI

Convert single scalar to single integer, with truncate

Reg

Xmm/Mem

1

3

 

Move operations

Dst

Src

Issue

Latency

MOVAPS (load)

Move from aligned memory to XMM register

Xmm

Mem

2

4

MOVAPS (reg)

Move XMM register to XMM register

Xmm

Xmm

1

1

MOVAPS (store)

Store from XMM register to aligned memory

Mem

Xmm

2

4

MOVUPS (load)

Load from unaligned memory to XMM register

Xmm

Mem

2

4

MOVUPS (store)

Store from XMM register to unaligned memory

Mem

Xmm

3

5

MOVSS (Load)

Load single scalar

Xmm

Mem

1

1

MOVSS (Reg)

Move single scalar

Xmm

Xmm

1

1

MOVSS (Store)

Store single scalar

Mem

Xmm

1

1

MOVMSKPS

Move MSB of packed scalars to integer register

Reg

Xmm

1

1

MOVLHPS

Move Low 2 packed scalars to high position

Xmm

Xmm

1

3

MOVHLPS

Move high 2 packed scalars to low position

Xmm

Xmm

1

3

MOVLPS (Load)

Load 2 packed scalars to low position

Xmm

Mem

1

3

MOVLPS (reg)

Move 2 packed scalars in low position

Xmm

Xmm

1

1

MOVLPS (Save)

Save 2 packed scalars in low position to memory

Mem

Xmm

1

3

MOVHPS (Load)

Load 2 packed scalars to high position

Xmm

Mem

1

3

MOVHPS (Reg)

Move 2 packed scalars in high position

Xmm

Xmm

1

1

MOVHPS (Save)

Save 2 packed scalars in high position to memory

Mem

Xmm

1

3

MOVNTPS

Store XMM register to aligned memory, non temporal

Mem

Xmm

2

4

SHUFPS

Shuffle single scalar within packed

Xmm

Xmm/Mem

2

2

UNPCKLPS

Unpack low

Xmm

Xmm/Mem

2

3

UNPCKHPS

Unpack high

Xmm

Xmm/Mem

2

3

 

Compare operations

Dst

Src

Issue

Latency

CMPPS

Compare packed scalar

Xmm

Xmm/Mem

2

4

CMPSS

Compare single scalar

Xmm

Xmm/Mem

1

3

COMISS

Compare single scalar and set EFLAGS

--

Xmm/Mem

1

1

UCOMISS

Unordered compare single scalar and set EFLAGS

--

Xmm/Mem

1

1

 

Logical operations

Dst

Src

Issue

Latency

ANDNPS

And Not packed scalar

Xmm

Xmm/Mem

2

2

ANDPS

And packed scalar

Xmm

Xmm/Mem

2

2

ORPS

Or packed scalar

Xmm

Xmm/Mem

2

2

XORPS

Exclusive or packed scalar

Xmm

Xmm/Mem

2

2

 

Memory operations

Dst

Src

Issue

Latency

PREFETCHT0

Prefetch using T0 hint

--

Mem

1

2

PREFETCHT1

Prefetch using T1 hint

--

Mem

1

2

PREFETCHT2

Prefetch using T2 hint

--

Mem

1

2

PREFETCHNTA

Prefetch using NTA hint (Non temporal)

--

Mem

1

2

SFENCE

Store fence

--

--

1

3

 

Integer/MMX operations

Dst

Src

Issue

Latency

PSHUFW

Packed shuffle word

Mmx

Mmx/Mem

1

1

PEXTRW

Extract word

Reg

Mmx

2

2

PINSRW

Insert word

mmx

Reg/Mem

1

4

PMINUB

Packed minimum unsigned byte

Mmx

Mmx/Mem

½

1

PMINSW

Packed minimum signed word

Mmx

Mmx/Mem

½

1

PMAXUB

Packed maximum unsigned byte

Mmx

Mmx/Mem

½

1

PMAXSW

Packed maximum signed word

mmx

Mmx/Mem

½

1

PMOVMSKB

Move byte mask to integer register

Reg

Mmx

1

1

PSADBW

Packed sum of absolute differences

Mmx

Mmx/Mem

2

5

PAVGW

Packed average word

Mmx

Mmx/Mem

½

1

PAVGB

Packed average byte

Mmx

Mmx/Mem

½

1

PMULHUW

Packed multiply high

Mmx

Mmx/Mem

1

3

MOVNTQ

Move QWORD non temporal

Mem

Mmx

1

3

MASKMOVQ

Byte mask write

Mmx

Mmx

1

4

 

Control operations

Dst

Src

FXSAVE

Store extended state (FP/MMX and SIMD)

Mem

--

m-code

FXRESTOR

Load extended state (FP/MMX and SIMD)

--

Mem

m-code

LDMXCSR

Load 32bytes of SIMD status/control

--

Mem

m-code

STMXSCR

Store 32bytes of SIMD status/control

Mem

--

m-code

What disappoints me about this instruction set is that there are no instructions to perform inter-register operations to calculate, for instance, a dot product. Although calculating a dot product can be performed by shuffling, a dot product instruction would have been very useful.

There has been talk on the Internet that a thermal noise random number generator is present within the Pentium III. Although this would be very useful, I cannot find any trace of it. If you know anything about it, let me know.


How do I make use of the new instructions?
 


Home | Join | Help | Contact Us | Shop | Newswire | Site Map | Calendar
Write for Us | Features | Connection | Job Search | Directories


Copyright © 2000 CMP Media Inc. All rights reserved.
Privacy Policy