Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Gamasutra: The Art & Business of Making Gamesspacer
Designing Fast Cross-Platform SIMD Vector Libraries
View All     RSS
June 13, 2021
arrowPress Releases
June 13, 2021
Games Press
View All     RSS







If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 

Designing Fast Cross-Platform SIMD Vector Libraries


January 20, 2010 Article Start Previous Page 3 of 5 Next
 

Good Programming Practices For a SIMD Vector Library

Even if a well-designed vector library is available, and especially when working with SIMD instructions, you must to pay attention on how you use it. Relying completely on the compiler to generate the best code is not a good idea.

1. Writing Expressions Friendly to The Compiler

Depending how write you code it can affect significantly the final assembled code. Even by using robust compilers the difference will not go unnoticed. Back to the Sine example, if that function was written by using smaller expressions the final code would have been much more efficient. Let's look at a new version of the same code as follows:

Vec4 VSin2(const Vec4& x)
{
Vec4 c1 = VReplicate(-1.f/6.f);
Vec4 c2 = VReplicate(1.f/120.f);
Vec4 c3 = VReplicate(-1.f/5040.f);
Vec4 c4 = VReplicate(1.f/362880);
Vec4 c5 = VReplicate(-1.f/39916800);
Vec4 c6 = VReplicate(1.f/6227020800);
Vec4 c7 = VReplicate(-1.f/1307674368000);

Vec4 tmp0 = x;
Vec4 x3 = x*x*x;
Vec4 tmp1 = c1*x3;
Vec4 res = tmp0 + tmp1;

Vec4 x5 = x3*x*x;
tmp0 = c2*x5;
res = res + tmp0;

Vec4 x7 = x5*x*x;
tmp0 = c3*x7;
res = res + tmp0;

Vec4 x9 = x7*x*x;
tmp0 = c4*x9;
res = res + tmp0;

Vec4 x11 = x9*x*x;
tmp0 = c5*x11;
res = res + tmp0;

Vec4 x13 = x11*x*x;
tmp0 = c6*x13;
res = res + tmp0;

Vec4 x15 = x13*x*x;
tmp0 = c7*x15;
res = res + tmp0;

return (res);
}

Now let's compare the results. (Click here to download the table data. Refer to Table 3.)

The code shrunk by almost 40% by simply re-writing in a more friendly way to the compiler.

2. Keep Results Into SIMD Registers

As stated before, casting operation between SIMD registers and FPU registers can be expensive. A classic example of when this happens is the "dot product" which results one scalar from two vectors, as such:

Dot (Va, Vb) = (Va.x * Vb.x) + (Va.y * Vb.y) + (Va.z * Vb.z) + (Va.w * Vb.w);

Now let's take a look at this code snipped that uses a dot product:

Vec4& x2 = m_x[i2];

Vec4 delta = x2-x1;

float deltalength = Sqrt(Dot(delta,delta));

float diff = (deltalength-restlength)/deltalength;

x1 += delta*half*diff;

x2 -= delta*half*diff;

By inspecting the code above, "deltalength" is the distance between vector "x1" and vector "x2". So the result of the "Dot" function is scalar. Then this scalar is used and modified throughout the rest of the code to scale vector "x1" and "x2". Clearly there are lots of casting operations going on from vector to scalar and vice-versa. This is expensive since the compiler needs to generate code that will move data from and to the SIMD and FPU registers.

However, if we assume the "Dot" function above replicates the result into the SIMD 4-quad words and the "w" component zeroed-out, there is really no difference in re-writing the code as follows:

Vec4& x2 = m_x[i2];

Vec4 delta = x2-x1;

Vec4 deltalength = Sqrt(Dot(delta,delta));

Vec4 diff = (deltalength-restlength)/deltalength;

x1 += delta*half*diff;

x2 -= delta*half*diff;

Because "deltalength" now has the same result replicated into the quad-word, the expensive casting operations are no longer necessary.

3. Re-Arrange Data to Be Friendly to SIMD operations

Whenever possible, by re-arranging your data you can take advantage of a vector library. For example, when working with audio you can store your data stream to be SIMD friendly.

Let's say you have four streams of audio samples stored in four different arrays as such:

Bass audio samples array:

B0

B1

B2

B3

B4

B5

B6

B7

Etc...

Drums audio samples array:

D0

D1

D2

D3

D4

D5

D6

D7

Etc...

Guitar audio samples array:

G0

G1

G2

G3

G4

G5

G6

G7

Etc...

Trumpet audio samples array:

T0

T1

T2

T3

T4

T5

T6

T7

Etc...

But you can also interleave these audio channels and store the same data as:

Full band audio samples array:

B0

D0

G0

T0

B1

D1

G1

T1

Etc...

The advantage is that now you can load your array directly into the SIMD registers and perform calculations on all samples simultaneously. The disadvantage is that if you need to pass that data to another system that requires four streamlined arrays, you will have re-organize that data back to its original form. Which can be expensive and memory intense if the vector library is not performing heavy calculations on the data.

XNA Math Library -- Can It Be Faster?

Microsoft ships DirectX with XNA Math that runs on Windows and Xbox 360. XNA Math is a vector library that has a FPU vector interface (for backwards compatibility) and a SIMD interface.

Although back in late 2004 I did not have access to XNA Math Library (if it existed that time), I was happy to discover now that XNA Math uses the precisely the same interface I had used in VMath five years ago. The only difference was the target platform, which was in my case was Windows, PS3, PSP, and PS2.

XNA Math is designed using the key features of a SIMD vector library described in here. It returns results by values, the vector data is declared purely, it has a support for overloaded operators and procedural calls, inline all vector functions, provide data accessors, and so on.

When two independent developers come up with the same results, it's highly likely you are close to the "as fast as it can be". So if you are faced with the task of coding a vector library your best bet for a cross-platform SIMD vector library is to follow the steps of XNA Math or VMath.

With that in mind, you may ask, can XNA Math be faster? As far as interfacing with the SIMD instructions, I don't believe so. However, the beauty of having the fastest interface is that what's left is purely the implementation of the functions.

If one can come up with a faster version of a function with the same interface, it is just a matter of plugging in the new code and you are done. But if you are working on a vector library that does not provide a good interface with the SIMD instructions, coming up with a better implementation may still leave you behind.


Article Start Previous Page 3 of 5 Next

Related Jobs

Disbelief
Disbelief — Cambridge, Massachusetts, United States
[06.11.21]

Programmer
Disbelief
Disbelief — Cambridge, Massachusetts, United States
[06.11.21]

Senior Programmer
Insomniac Games
Insomniac Games — Burbank, California, United States
[06.11.21]

Technical Artist - Pipeline
Insomniac Games
Insomniac Games — Burbank, California, United States
[06.11.21]

Technical Artist - Pipeline





Loading Comments

loader image