| |
|
|
||||
![]() |
||||||
| |
|
|||||
|
Data Movement and Data Manipulation
For everyday data movement, the Streaming SIMD Extensions provide move instructions. The movaps (move aligned packed single) and movups (move unaligned packed single) instructions transfer 128 bits of packed data from memory to one of the XMM registers and vice-versa, or between XMM registers. The faster movaps instruction can be used if the data is aligned on a 16-byte boundary. Remember that this is a four-way parallel instruction set; we want to get the most parallelism out of the code as we can. Let’s say your data wasn’t laid out in memory four in a row. To that end, some data manipulation may be required. Since we’re using a packed data type, it’s important to provide ways to get the data into the correct format for optimal use by the instruction set. To that end, the instruction set now has instructions for performing data manipulations like shuffles, 64-bit moves, packing and unpacking, inserts and extracts. For instance, say you want to perform simple dot products. In most 3D engines, data is laid out in a simple structure like this (where ‘w’=1): struct vertice {
float x, y, z, w; } Then the following code performs the dot products: for (i=0;...) FR3 = ((X*m00) + (Y*m01) + (Z*m02) + mat03); } Which performs operations as described in figure 6.
In Figure 6, you can see that we’re wasting 25% of our execution bandwidth in the multiply (we really only have to do three multiplies, assuming w=1), and we suffer from the additional overhead of three shuffles and three adds to get the final result. Optimally, the data should be set up in a parallel format, so that the four dot products could be done with three multiplies and three adds, as shown in Figure 7. These parallel calculations can be done with the Streaming SIMD Extensions in the same time it took to do the one dot product on the Pentium II processor.
How do you go about reordering the data? One method is to use the 64-bit movhps dest, src (see Figure 8) and shuffle shufps dest, src, mask (see Figure 9). The 64-bit move instructions can be used to move 64 bits representing two single precision operands to and/or from the either the upper or lower 64-bits of the src to the dest.
The shuffle can be used to rotate, shift, swap and broadcastdata between two registers or within one register (if both src and dest are the same), under the control of a mask. The mask contains eight bits; two bits for each data element in the dest. Bits 0 and 1 of the immediate field are used to select which of the four input numbers will be used as the first number of the result; bits 2 and 3 of the immediate field are used to select which of the four input numbers will be used as the second number, and so on.
Now we’ll show an example of how these instructions can be used to reorganize vertex data. (The "-" symbol in the comments below denotes a "don’t care".) // Where xmm7 = -z0y0x0;
xmm2 = -z1y1x1; movhps temp1,
xmm7 movhps temp2,
xmm4 shufps xmm7,
xmm2, 0x44 shufps xmm4,
xmm3, 0x44 movaps xmm5,
xmm7 shufps xmm7,
xmm4, 0xDD shufps xmm5,
xmm4, 0x88 movhps xmm6,
temp1 shufps xmm6,
xmm2, 0x22 movhps xmm2,
temp2 shufps xmm2,
xmm3, 0x22 shufps xmm6,
xmm2, 0x88
|
|
|