




|
By
Rob Wyatt
Gamasutra
May
28, 1999
Vol. 3: Issue 21

| |

Wyatt's World

Cracking
Open The Pentium III
What are these new SIMD
instructions?
The tables below cover all
the new Streaming SIMD Instructions for floating point and integer operations.
The integer streaming SIMD instructions are actually extensions to MMX,
work the same way as the existing MMX instructions, and use the same registers.
All the floating-point operations have two forms of each instruction:
a packed format indicated by instructions ending in "PS", and
a single format indicated by instructions ending in "SS". The
PS instructions perform operations on each of the four floating-point
elements within a XMM register (Figure 1),
whereas the SS instructions operate only on the bottom float, leaving
the others untouched (Figure 2). The data
is stored within XMM registers in a right-to-left order, so the value
on the righthand side is the least significant 32 bits. Note that this
can be confusing if you store a vector in memory as [x,y,z,w], because
it appears as [w,z,y,x].
|
XMM0
|
8.0
|
6.0
|
4.0
|
2.0
|
|
|
*
|
*
|
*
|
*
|
|
XMM1
|
3.0
|
5.0
|
7.0
|
9.0
|
|
|
=
|
=
|
=
|
=
|
|
XMM0
|
24.0
|
30.0
|
28.0
|
18.0
|
| Figure
1. Example of the MULPS
xmm0,xmm1 instruction |
|
XMM0
|
8.0
|
6.0
|
4.0
|
2.0
|
|
|
|
|
|
*
|
|
XMM1
|
3.0
|
5.0
|
7.0
|
9.0
|
|
|
=
|
=
|
=
|
=
|
|
XMM0
|
8.0
|
6.0
|
4.0
|
18.0
|
| Figure
2. Example of the MULSS
xmm0,xmm1 instruction |
The next tables show the various
Streaming SIMD operations. The two columns on the far right side of the
table are the issue (throughput) and latency times for each instruction.
For example, ADDPS can
be issued every two cycles, and each instruction has a latency of four
cycles. Unfortunately, there is a little more to scheduling than these
simple timings, because the execution port and resource usage must be
taken into account. These numbers give you a rough idea, though. For more
information on decode scheduling, see the latest Intel optimization reference
manual, available at http://developer.intel.com.
The "Src" and "Dst"
columns in the following tables show possible locations for the source
and destination operands of the various instructions. The following combination
of symbols are used:
Xmm (Floating
point SIMD Multimedia register)
Mmx (Integer
MMX Multimedia register)
Mem (Memory
address/Indirect address)
Reg (x86
integer register)
|
Mathematical
operations
|
Dst
|
Src
|
Issue
|
Latency
|
|
ADDPS
|
Add packed
scalar
|
Xmm
|
Xmm/Mem
|
2
|
4
|
|
ADDSS
|
Add single
scalar
|
Xmm
|
Xmm/Mem
|
1
|
3
|
|
SUBPS
|
Subtract packed
scalar
|
Xmm
|
Xmm/Mem
|
2
|
4
|
|
SUBSS
|
Subtract single
scalar
|
Xmm
|
Xmm/Mem
|
1
|
3
|
|
MULPS
|
Multiply packed
scalar
|
Xmm
|
Xmm/Mem
|
2
|
5
|
|
MULSS
|
Multiply single
scalar
|
Xmm
|
Xmm/Mem
|
1
|
4
|
|
DIVPS
|
Divide packed
scalar
|
Xmm
|
Xmm/Mem
|
38
|
38
|
|
DIVSS
|
Divide single
scalar
|
Xmm
|
Xmm/Mem
|
18
|
18
|
|
SQRTPS
|
Square root
packed scalar
|
Xmm
|
Xmm/Mem
|
58
|
58
|
|
SQRTSS
|
Square root
single scalar
|
Xmm
|
Xmm/Mem
|
30
|
30
|
|
RCPPS
|
Reciprocal
packed scalar
|
Xmm
|
Xmm/Mem
|
2
|
2
|
|
RCPSS
|
Reciprocal
single scalar
|
Xmm
|
Xmm/Mem
|
2
|
2
|
|
RSQRTSS
|
Reciprocal
square root single scalar
|
Xmm
|
Xmm/Mem
|
2
|
2
|
|
RSQRTPS
|
Reciprocal
square root packed scalar
|
Xmm
|
Xmm/Mem
|
2
|
2
|
|
MAXPS
|
Maximum packed
scalar
|
Xmm
|
Xmm/Mem
|
2
|
4
|
|
MAXSS
|
Maximum single
scalar
|
Xmm
|
Xmm/Mem
|
1
|
4
|
|
MINPS
|
Minimum packed
scalar
|
Xmm
|
Xmm/Mem
|
2
|
4
|
|
MINSS
|
Minimum single
scalar
|
Xmm
|
Xmm/Mem
|
1
|
3
|
|
Conversion
operations
|
Dst
|
Src
|
Issue
|
Latency
|
|
CVTPI2PS
|
Convert packed
integer to packed scalar
|
Xmm
|
Mmx/Mem
|
1
|
3
|
|
CVTSI2SS
|
Convert single
integer to single scalar
|
Xmm
|
Reg/Mem
|
2
|
4
|
|
CVTPS2PI
|
Convert packed
scalar to packed integer
|
Mmx
|
Xmm/Mem
|
1
|
3
|
|
CVTSS2SI
|
Convert single
scalar to single integer
|
Reg
|
Xmm/Mem
|
1
|
3
|
|
CVTTPS2PI
|
Convert packed
scalar to packed integer, with truncate
|
Mmx
|
Xmm/Mem
|
1
|
3
|
|
CVTTSS2SI
|
Convert single
scalar to single integer, with truncate
|
Reg
|
Xmm/Mem
|
1
|
3
|
|
Move
operations
|
Dst
|
Src
|
Issue
|
Latency
|
|
MOVAPS
(load)
|
Move from aligned
memory to XMM register
|
Xmm
|
Mem
|
2
|
4
|
|
MOVAPS
(reg)
|
Move XMM register
to XMM register
|
Xmm
|
Xmm
|
1
|
1
|
|
MOVAPS
(store)
|
Store from
XMM register to aligned memory
|
Mem
|
Xmm
|
2
|
4
|
|
MOVUPS (load)
|
Load from unaligned
memory to XMM register
|
Xmm
|
Mem
|
2
|
4
|
|
MOVUPS
(store)
|
Store from
XMM register to unaligned memory
|
Mem
|
Xmm
|
3
|
5
|
|
MOVSS (Load)
|
Load single
scalar
|
Xmm
|
Mem
|
1
|
1
|
|
MOVSS
(Reg)
|
Move single
scalar
|
Xmm
|
Xmm
|
1
|
1
|
|
MOVSS
(Store)
|
Store single
scalar
|
Mem
|
Xmm
|
1
|
1
|
|
MOVMSKPS
|
Move MSB of
packed scalars to integer register
|
Reg
|
Xmm
|
1
|
1
|
|
MOVLHPS
|
Move Low 2
packed scalars to high position
|
Xmm
|
Xmm
|
1
|
3
|
|
MOVHLPS
|
Move high 2
packed scalars to low position
|
Xmm
|
Xmm
|
1
|
3
|
|
MOVLPS
(Load)
|
Load 2 packed
scalars to low position
|
Xmm
|
Mem
|
1
|
3
|
|
MOVLPS
(reg)
|
Move 2 packed
scalars in low position
|
Xmm
|
Xmm
|
1
|
1
|
|
MOVLPS
(Save)
|
Save 2 packed
scalars in low position to memory
|
Mem
|
Xmm
|
1
|
3
|
|
MOVHPS
(Load)
|
Load 2 packed
scalars to high position
|
Xmm
|
Mem
|
1
|
3
|
|
MOVHPS (Reg)
|
Move 2 packed
scalars in high position
|
Xmm
|
Xmm
|
1
|
1
|
|
MOVHPS (Save)
|
Save 2 packed
scalars in high position to memory
|
Mem
|
Xmm
|
1
|
3
|
|
MOVNTPS
|
Store XMM register
to aligned memory, non temporal
|
Mem
|
Xmm
|
2
|
4
|
|
SHUFPS
|
Shuffle single
scalar within packed
|
Xmm
|
Xmm/Mem
|
2
|
2
|
|
UNPCKLPS
|
Unpack low
|
Xmm
|
Xmm/Mem
|
2
|
3
|
|
UNPCKHPS
|
Unpack high
|
Xmm
|
Xmm/Mem
|
2
|
3
|
|
Compare
operations
|
Dst
|
Src
|
Issue
|
Latency
|
|
CMPPS
|
Compare packed
scalar
|
Xmm
|
Xmm/Mem
|
2
|
4
|
|
CMPSS
|
Compare single
scalar
|
Xmm
|
Xmm/Mem
|
1
|
3
|
|
COMISS
|
Compare single
scalar and set EFLAGS
|
--
|
Xmm/Mem
|
1
|
1
|
|
UCOMISS
|
Unordered compare
single scalar and set EFLAGS
|
--
|
Xmm/Mem
|
1
|
1
|
|
Logical
operations
|
Dst
|
Src
|
Issue
|
Latency
|
|
ANDNPS
|
And Not packed
scalar
|
Xmm
|
Xmm/Mem
|
2
|
2
|
|
ANDPS
|
And packed
scalar
|
Xmm
|
Xmm/Mem
|
2
|
2
|
|
ORPS
|
Or packed scalar
|
Xmm
|
Xmm/Mem
|
2
|
2
|
|
XORPS
|
Exclusive or
packed scalar
|
Xmm
|
Xmm/Mem
|
2
|
2
|
|
Memory
operations
|
Dst
|
Src
|
Issue
|
Latency
|
|
PREFETCHT0
|
Prefetch using
T0 hint
|
--
|
Mem
|
1
|
2
|
|
PREFETCHT1
|
Prefetch using
T1 hint
|
--
|
Mem
|
1
|
2
|
|
PREFETCHT2
|
Prefetch using
T2 hint
|
--
|
Mem
|
1
|
2
|
|
PREFETCHNTA
|
Prefetch using
NTA hint (Non temporal)
|
--
|
Mem
|
1
|
2
|
|
SFENCE
|
Store fence
|
--
|
--
|
1
|
3
|
|
Integer/MMX
operations
|
Dst
|
Src
|
Issue
|
Latency
|
|
PSHUFW
|
Packed shuffle
word
|
Mmx
|
Mmx/Mem
|
1
|
1
|
|
PEXTRW
|
Extract word
|
Reg
|
Mmx
|
2
|
2
|
|
PINSRW
|
Insert word
|
mmx
|
Reg/Mem
|
1
|
4
|
|
PMINUB
|
Packed minimum
unsigned byte
|
Mmx
|
Mmx/Mem
|
½
|
1
|
|
PMINSW
|
Packed minimum
signed word
|
Mmx
|
Mmx/Mem
|
½
|
1
|
|
PMAXUB
|
Packed maximum
unsigned byte
|
Mmx
|
Mmx/Mem
|
½
|
1
|
|
PMAXSW
|
Packed maximum
signed word
|
mmx
|
Mmx/Mem
|
½
|
1
|
|
PMOVMSKB
|
Move byte mask
to integer register
|
Reg
|
Mmx
|
1
|
1
|
|
PSADBW
|
Packed sum
of absolute differences
|
Mmx
|
Mmx/Mem
|
2
|
5
|
|
PAVGW
|
Packed average
word
|
Mmx
|
Mmx/Mem
|
½
|
1
|
|
PAVGB
|
Packed average
byte
|
Mmx
|
Mmx/Mem
|
½
|
1
|
|
PMULHUW
|
Packed multiply
high
|
Mmx
|
Mmx/Mem
|
1
|
3
|
|
MOVNTQ
|
Move QWORD
non temporal
|
Mem
|
Mmx
|
1
|
3
|
|
MASKMOVQ
|
Byte mask write
|
Mmx
|
Mmx
|
1
|
4
|
|
Control
operations
|
Dst
|
Src
|
|
|
FXSAVE
|
Store extended
state (FP/MMX and SIMD)
|
Mem
|
--
|
m-code
|
|
FXRESTOR
|
Load extended
state (FP/MMX and SIMD)
|
--
|
Mem
|
m-code
|
|
LDMXCSR
|
Load 32bytes
of SIMD status/control
|
--
|
Mem
|
m-code
|
|
STMXSCR
|
Store 32bytes
of SIMD status/control
|
Mem
|
--
|
m-code
|
What disappoints me about this
instruction set is that there are no instructions to perform inter-register
operations to calculate, for instance, a dot product. Although calculating
a dot product can be performed by shuffling, a dot product instruction
would have been very useful.
There has been talk on the
Internet that a thermal noise random number generator is present within
the Pentium III. Although this would be very useful, I cannot find any
trace of it. If you know anything about it, let me know.
|