Gamasutra: The Art & Business of Making Gamesspacer
Designing Fast Cross-Platform SIMD Vector Libraries
arrowPress Releases
October 31, 2014
PR Newswire
View All

If you enjoy reading this site, you might also want to check out these UBM Tech sites:

Designing Fast Cross-Platform SIMD Vector Libraries

January 20, 2010 Article Start Page 1 of 5 Next


Inside most 3D applications there exists a vector library to perform routine calculations such as vector arithmetic, logic, comparison, dot and cross products, and so on. Although there are countless ways to go about designing this type of library, developers often miss key factors to allow a vector library to perform these calculations the fastest possible way.

Around late 2004 I was assigned to develop such a vector library code-named VMath, standing for "Vector Math." The primary goal of VMath was not only to be the fastest but also easily portable across different platforms.

To my surprise in 2009, the compiler technology has not changed much. Indeed the results presented in this article resulted from my research during that time, with some exceptions, are nearly the same that when I was working on VMath five years ago.

"Fastest" Library

Since this article is mostly written in C++ and focused primarily in performance, defining the fastest library can be misleading sometimes.

Therefore, the fastest library as described in here, is the one that generates the smallest assembly code compared to other libraries when compiling the same code using the same settings (assuming release mode of course). This is because the fastest library generates fewer instructions to perform the same exact calculations. Or in other words, the fastest library is the one the bloats the code the least.

SIMD Instructions

With the wide spread of single instruction multiple data instructions (SIMD) around modern processors the task of developing a vector library has become much easier. SIMD operations work on SIMD registers precisely as FPU operations on FPU registers. However, the advantage is that SIMD registers are usually 128-bit wide forming the quad-word: four "floats" or "ints" of 32-bits each. This allows developers to perform 4D vector calculations with a single instruction. Because of that, the best feature a vector library can have is to take advantage of the SIMD instructions in it.

Nonetheless, when working with SIMD instructions you must watch out for common mistakes that can cause the library to bloat the code. In fact the code bloat of a SIMD vector library can be drastic to a point that it would have been better to simply use FPU instructions.

Vector Library Interface

The best way to talk to SIMD instructions when designing a high level interface of a vector library is by the usage of intrinsics. They are available from most compilers that target processors with SIMD instructions. Also, each instrisics translates into a single SIMD instruction. However the advantage of using intrinsics instead of writing assembly directly is to allow the compiler to perform scheduling and expression optimization. That can significantly minimize code bloat.

Examples of instrinsics below:

Intel & AMD:

vr = _mm_add_ps(va, vb);

Cell Processor (SPU):

vr = spu_add(va, vb);


vr = vec_add(va, vb);

1. Returning results by value

By observing the intrisics interface a vector library must imitate that interface to maximize performance. Therefore, you must return the results by value and not by reference, as such:

inline Vec4 VAdd(Vec4 va, Vec4 vb)
return(_mm_add_ps(va, vb));

On the other hand if the data is returned by reference the interface will generate code bloat. The incorrect version below:

//incorrect (code bloat!)
inline void VAddSlow(Vec4& vr, Vec4 va, Vec4 vb)
vr = _mm_add_ps(va, vb);

The reason you must return data by value is because the quad-word (128-bit) fits nicely inside one SIMD register. And one of the key factors of a vector library is to keep the data inside these registers as much as possible. By doing that, you avoid unnecessary loads and stores operations from SIMD registers to memory or FPU registers. When combining multiple vector operations the "returned by value" interface allows the compiler to optimize these loads and stores easily by minimizing SIMD to FPU or memory transfers.

2. Data Declared "Purely"

Here, "pure data" is defined as data declared outside a "class" or "struct" by a simple "typedef" or "define". When I was researching various vector libraries before coding VMath, I observed one common pattern among all libraries I looked at during that time. In all cases, developers wrapped the basic quad-word type inside a "class" or "struct" instead of declaring it purely, as follows:

class Vec4
__m128 xyzw;

This type of data encapsulation is a common practice among C++ developers to make the architecture of the software robust. The data is protected and can be accessed only by the class interface functions. Nonetheless, this design causes code bloat by many different compilers in different platforms, especially if some sort of GCC port is being used.

An approach that is much friendlier to the compiler is to declare the vector data "purely", as follows:

typedef __m128 Vec4;

Admittedly a vector library designed that way will lose the nice encapsulation and protection of its fundamental data. However the payoff is certainly noticeable. Let's look at an example to clarify the problem.

We can approximate the sine function by using Maclaurin (*) series as below:

(*) There are better and faster ways to approximate the sine function in production code. The Maclaurin series is used here just for illustrative purposes.

If a developer codes a vector version of a sine function using the formula above the code would look like more or less:

Vec4 VSin(const Vec4& x)
Vec4 c1 = VReplicate(-1.f/6.f);
Vec4 c2 = VReplicate(1.f/120.f);
Vec4 c3 = VReplicate(-1.f/5040.f);
Vec4 c4 = VReplicate(1.f/362880);
Vec4 c5 = VReplicate(-1.f/39916800);
Vec4 c6 = VReplicate(1.f/6227020800);
Vec4 c7 = VReplicate(-1.f/1307674368000);

Vec4 res = x +
c1*x*x*x +
c2*x*x*x*x*x +
c3*x*x*x*x*x*x*x +
c4*x*x*x*x*x*x*x*x*x +
c5*x*x*x*x*x*x*x*x*x*x*x +
c6*x*x*x*x*x*x*x*x*x*x*x*x*x +

return (res);

Now let's look at the assembler of the same function compiled as Vec4 declared "purely" (left column) and declared inside a class (right column). (Click here to download the table document. Refer to Table 1.)

The same exact code shrinks by a factor of approximate 15 percent simply by changing how the fundamental data was declared. If this function was performing some inner loop calculation, there would not only be savings in code size but certainly would run faster.

Article Start Page 1 of 5 Next

Related Jobs

Sega Networks Inc.
Sega Networks Inc. — Madison, Wisconsin, United States

Mobile Game Engineer
Forio — San Francisco, California, United States

Web Application Developer Team Lead
The Workshop
The Workshop — Marina del Rey, California, United States

InnoGames GmbH
InnoGames GmbH — Hamburg, Germany

Mobile Developer C++ (m/f)


Chuck Walbourn
profile image
Great article! SIMD optimization should be used extensively in shipping games, but it remains somewhat of a black art to many developers.

Here is a little background that might illuminate the design of XNAMath for anyone who is interested:

"XNAMath" began life back in 2004 as "xboxmath" for the Xbox 360, and was originally developed by the Xbox 360 graphics team. The original implementation was a scalar version, which is retained today as the NO_INTRINSICS codepath, and then extended with a VMX128 optimized codepath. VMX128 is the Xbox 360 SIMD instruction set, a variant of PowerPC AltiVec. In late 2008/early 2009, we extended it again for SSE2 to provide optimized cross-platform support with Windows x86 and Windows x64 and "XNAMath" is the result of this work. Incidentally, this is why XNAMath's version number is 2.0 since it is "xboxmath 2.0" from the perspective of Xbox 360 developers.

Many of the original design features of "xboxmath" have been preserved in the current implementation to ensure that users can easily move math code between Windows and Xbox 360 and retain the SIMD performance benefits. The API is therefore designed to basically map to what the VMX128 instruction set can do. This is why, for example, the XNAMath library is missing some things you can do easily in SSE2 like more integer-based operations. This has resulted in some quirks, but for games and high-performance graphics applications this is generally not a major limitation. You can always drop into platform-specific code at any point and of course since XNAMath is completely inline all the source is available for reference.

One of the points you noted was that operator/ took a complex path that ended up doing a full divide followed by a mul trying to avoid the divide in the first place This is the result of two basic design decisions in the library:

(1) You'll note that there is no XMVectorDivide() operator. The assumption is that you will do a reciprocal and a multiply, which is what you have to do on VMX128 as it lacks a direct divide operator. The SSE version of XMVectorReciprocal() does do a divide by 1 because we want it to return a full precision result generally, but you could also choose to use XMVectorReciprocalEst() instead which avoids the divide. I've noted a bug to look at just adding an XMVectorDivide() function directly since that would give us an opportunity to optimize the SSE2 path for a case that was equivalent for VMX128, and then we can update operator/ to use XMVectorDivide() directly.

(2) In general, we don't focus on optimizing operator overloads. They are there for convenience, but in general we assume if you really care about performance you will avoid them in bottleneck functions. For high-performance code, you'd be better off use #define XM_NO_OPERATOR_OVERLOADS and always using the explicit functional path. Operator overloads certainly work, and are extremely convenient, so we left the use of them up to the user instead of trying to make a policy decision to completely disallow them. The majority of operator overloads are trivial calls. The operator / and operator /= are the only operators where we do more than one function, so adding XMVectorDivide() should address this concern.

Again it is great to see this level of deep technical information being shared with the community, and Iím happy to see XNAMath being a useful part of your projects. Thanks again and keep up the good work.

Stephen Northcott
profile image
Great article..

Didn't learn much about vector maths I did not know already, but did pick up some stuff I did not know about compiler optimisations.

So in that sense the article did exactly what it said on the box. Thanks.

Alex Covic
profile image
Great article! I remember reading about VMath (I don't code video games).

"To my surprise in 2009, the compiler technology has not changed much" - ... made me laugh, sorry. I am not that surprised.

"The vector_math library (previously known as math3d++) is a C++ 3d math library" Link

Also reminded me of this from 2006:

Jeremy Alessi
profile image
Great stuff, thanks for sharing!

Scott Foulk
profile image
Awesome study Gustavo! As a fan of Michael Abrash's 1990 'Zen of Assembly Language, Volume 1' (no. 2 did not appear since the CPU's changed radically after book 1) I seriously enjoy this kind of article. I am a great fan of Assembly language programming and most people think I'm nuts because of it. But I love the performance and efficiency that only a serious study of compiler-generated Assembly can give, and Mr. Oliveira has done it.

Compiler's are limited to the code generation formulae built-in, and though far better than ye olde compilers, are still in need of the human eyeball for some serious code squeezing. But for many programmers who would not delve deep into this area, Mr. Oliveira has provided information for understanding the inner workings of the compiler and how better to improve the output code performance. Excellent.

Jani Kajala
profile image
BSD/MIT-licensed vector lib slmath @

Terence Groening
profile image
Good article. I agree with many of the points and learned a couple things.

I'm a bit curious about the FPU implementation, because the most interesting result was that the Intel compiler made that test "almost competitive" with the SIMD approaches ("only" ~80% slower, compared to the ~200% slower with the Microsoft compiler). Was this using 3-float vectors or 4-float vectors? Based on the name of the test, I assume SSE optimizations were turned off, or are you considering SSE to be part of the FPU?

The reason I ask is because the reduced size of 3-float vectors tends to make a significant speed advantage with non-SIMD implementations. And with SSE optimizations turned on, you can (sometimes) get another small boost. While these changes would certainly not make up the performance difference, they might make it close enough that the memory advantage comes into play, not to mention avoiding the headaches of coding with alignment and intrinsics.

Thanks for any info!

Chuck Walbourn
profile image
XMVectorDivide() missed the schedule for XNAMath v2.02 shipping in our upcoming DXSDK release. It should be in version v2.03 or later. Since it's all in the header, it is easy enough to update XNAMATHVECTOR.INL yourself in the meantime:



#if defined(_XM_NO_INTRINSICS_)



// Avoid C4701

Result.vector4_f32[0] = 0.0f;

for (i = 0; i < 4; i++)


if (XMISINF(V2.vector4_f32[i]))


Result.vector4_f32[i] = (V2.vector4_f32[i] < 0.0f) ? -0.0f : 0.0f;


else if (V2.vector4_f32[i] == -0.0f || V2.vector4_f32[i] == 0.0f)


Result.vector4_u32[i] = 0x7F800000;




Result.vector4_f32[i] = V1.vector4_f32[i] / V2.vector4_f32[i];



return Result;

#elif defined(_XM_SSE_INTRINSICS_)

return _mm_div_ps( V1, V2 );

#else // _XM_VMX128_INTRINSICS_

// This case for Xbox 360 is equivalent to:



// R = XMVectorReciprocal(V2);

// return XMVectorMultiply(V1,R);

#endif // _XM_VMX128_INTRINSICS_


Then update the following to call it...



V1 = XMVectorDivide(V1,V2);

return V1;




return XMVectorDivide(V1,V2);


Of course it is actually faster to call XMVectorReciprocalEst() followed by XMVectorMultiply() if you are willing to tolerate the loss of precision.

Chuck Walbourn
profile image
One more note, be sure to check out the Visual Studio 2010 compiler as there were some specific efforts around improving code-generation with intrinsics.

Gustavo Oliveira
profile image
Thank you all for reading and the great comments. Chuck, thanks for explaining the history and design choices of XNA Math. I had a feeling that the reciprocal was due to some platform compatibility issue. Nonetheless, I am glad I helped to make Xna Math even faster. I have a feeling next time "Rocky" and "Apollo" get in the ring it won't be even :) Scott, thanks and like yourself I'd rather type asm, but unfortunately the golden days of asm {...} are getting less frequent and our lives more boring :) Terence, the vectors are all 4d, since I was equalizing 4 streams of audio and for the cloth the w=0. The SSE2 optimizations are all off. I don't consider SIMD part of the FPU in this article. I can't really check on the intel version anymore since my license expired (but it should be the same). I asked intel for a free copy in exchange for the article to make more tests but they denied. I was cheap to ask and they were even cheaper not to give me to me :) But certainly if you can squeeze your data and you aren't performing heavy calculations on it, you may beat a SIMD library. But the article is more about the best SIMD design for different targets. BTW, As I re-read the article this morning, it does feel like a PC article. I apologize for not pasting different target assembly code for you guys, but I don't have access to a console anymore. But if you guys test similar code on different platforms; the results should be equivalent.

Greg Crabtree
profile image
Excellent article Gustavo!

These are all great points for trying to get generic C++ code to hopefully compile as best as possible.

While I think compilers can take you most of the way in some cases (especially Intel's latest and greatest), I still feel it is important that for critical yet non-trivial functions you optimize in architecture specific assembly.

It's a giant pain, but if you need to crank out some math, it's the only way to be sure you are as optimized as possible.

Arshad Hussain
profile image
No doubt! It is awesome. Thank you so much Gustavo for contributing such a nice article.

James Steele
profile image
Gustavo, I have a little problem regarding the "Data Declared "Purely" section. Ever since the PS2 days, I've been having my SIMD classes encapsulated as a class with no issues. I wrote a quick and dirty test for Intel SSE under Visual Studio using a Vec3 class, and didn't find any issue with the generated assembler.

Are you absolutely sure about this little rule? I'm pretty sure that even the SPU compiler knows when it can treat a reference to a class that only contains a 128-bit data type, as something it can pass to the inlined operator in the register.

I did this with compound addition statements, and found that there were no issues at all with passing a reference to a class (it had to be a reference because of alignment declaration rules). In the code, they were all passed around via the registers.

The only time I found that any data had to be strored/loaded was when I was doing explicitly doing this for myself.

Have you checked the dissasembly on your target platforms with this at all?

Terence Groening
profile image
Thanks for the reply Gustavo. Yeah, I didn't mean to steer the subject even more towards PC, but that's what I'm currently interested in.

Gustavo Oliveira
profile image
Thx Greg. James, I am very much in sync with Microsoft in that point. I did find the same code bloat from encapsulated data with ps2, psp and ps3. I even had a discussion with the SN system compiler engineer and asked if anything could be done. At that time (5 years ago), they said it would be difficult to resolve since it had to do with aliasing. You can check out Table 1 of this article, it was compiled with cl.exe about 2 months ago. The problem is easier to see when you mix operators with constants too. Since you are stacking patterns of operators, the optimizer maybe catching it. I believe your best bet is to try to port the sine function or the DSP equalizer (the cloth is more work) I shown to your library. Then compared your assembled code to XNA Math assembled code for example. I can't really test on ps3 anymore, and things may have changed since. But in all cases, I'd be curious to see your data declaration. If you can trim your project to a trivial test, you can send me over. I don't promise when I will look at, but I am open to see if there is any trick on how to go about this problem in a generic way.

Gustavo Oliveira
profile image
thx Arshad.

Brent Schartung
profile image
"To my surprise in 2009, the compiler technology has not changed much."

The folks working on GCC's auto-vectorization might be surprised to hear that.

It appears to support AltiVec and SSE/SSE2. I would be a great addition to the article to see how well the auto-vectorize feature of GCC compares to vector-aware libraries like XNA Math and VMath.

And perhaps a newbie question -- why disable SSE2? "The SSE2 optimizations are all off."

Mirza Husadzic
profile image
Great article.

It motivated me to write article about using STL vector for XNAMath types on x86 platform.

The article targets Windows developers who are new to SIMD.

You can read it at:

stan melax
profile image
Hmmm, my compiler and math lib doesn't have the problems you experience with vclass::vec4().

After studying your code I believe i see the root cause to your troubles with longer expressions.

Your vclass.h Vec4 does not implement the implicit constructor Vec4::Vec4(Vec4 &), so this function will have to be generated by the c++ compiler according to the C++ spec. This default is a simply a copy which might explain why the assembly in table 2 is spilling registers to memory.

Two simple solutions.

1) Dont use the copy constructor within the operator implementations. i.e. simply change:

inline const Vec4 operator+(const Vec4 &rhs) const


Vec4 res(*this);

res.xyzw = _mm_add_ps(res.xyzw, rhs.xyzw);




inline const Vec4 operator+(const Vec4 &rhs) const


Vec4 res(); // dont initialize and use copy constructor here

res.xyzw = _mm_add_ps(xyzw, rhs.xyzw); // res only on lhs



and do the same for the other operators

Alternatively just add:

inline Vec4(Vec4 &rhs) { xyzw=rhs.xyzw;}

Then retest and hopefully the vclass will now be the same as the vmath versions.

Let us know if this fixes your problems with C++ simd usage.


stan melax

Philip Dunstan
profile image
Great catch Stan. I downloaded Gustavo's demo and tested the vclass version. Testing with VS2008 it appears that the compiler generated copy constructor and assignment operator are not being automatically inlined. Simply providing inlined versions of these functions is enough to bring the performance of the vclass implementation in line with the vmath version.

Excellent work by the way Gustavo in providing a demo in which it was very simple to replicate your results. Thanks for the article.

Gustavo Oliveira
profile image
Hello Philip And Stan, sorry it took me over a nearly a year to respond. I made some tests with the suggestion of the missing copy constructor and for me it did not create equal results as you suggested. I may have been missing something, but just in case, I trimmed my project to a simple solution with the results. You can download and test it (it's a VStudio Solution self contained). The results have indeed improve the quality of the generated code by the overloaded operators. Yet it did not beat or made as lean as the C-like interface. I posted the project in my web-site, if you would like to take a looK:

Bear also in mind that when I was doing these tests I was not only looking at Windows projects. This article looks like a PC-only, but it really is not. The same type of architecture was cross-compiled in different platforms producing similar results. The ideal test is if someone (unfortunately, I don't have access to PS2, PSP or PS3 compilers anymore) is to try to port this project to the various platforms with intrisics to see what are the final results.

All that said, I am still endosing the XNAMath architecure which is quite similar to what I had used in VMath many years ago.

thank for the great comments.

Mathias Gaunard
profile image
This is a fairly old article, but let me say that I've designed a SIMD abstraction library, and I've run into those issues.

Unfortunately, you cannot always work at the level of the native type, because for example it's not possible to distinguish the various possible integer types from the type __m128i (unlike in altivec, where they all have a different type), so it needs to be wrapped in a type (preferably a POD, compilers generate better code for those -- that means no protected or private).

Operator overloading is also a must; with my library we even recognize patterns such as a * b + c and map them to vec_madd if such instructions are available.

A problem is that MSVC refuses to pass a structure that contains a __m128 by value, even though it does allow to pass a __m128 by value. As a result, we have to pass by reference, which may cause some compilers to incorrectly go back to memory and not keep everything in registers.

Basically, if everything gets inlined then there is no problem, but as soon as things do not, we run into problems which are purely due to calling conventions.

Those problems do not have easy solutions however, unless you can hook into the compiler to add new calling conventions that do the right thing (fastcall is not enough).

Thankfully, some compilers, like clang, can do a very good job at converting pass-by-reference to pass-by-value or optimizing calling conventions. For most platforms, inlining is also not a problem, and it solves those issues.

adrian hirst
profile image
Good article. Thanks for sharing!

I hate to chip in long after the fact here, but for anyone else reading this, I believe Philip is correct - at least for the case of the MSVC compiler. I had to test this out as it went against what I've seen elsewhere.

I ran your code contained in and it seems all the calculations are done in the same number of calls. The extra instructions you are seeing are just caused by writing back into the history buffer after all of the actual operations have completed:

// Shuffle history buffer
es->sdm3 = es->sdm2;
es->sdm2 = es->sdm1;
es->sdm1 = sample;

As there is no Vec4::'=' operator, the compiler generates a general purpose copy, which involves copying out of the SSE registers to main memory (many operations and also slow in itself).

Providing the required operator generates code with exactly the same number of instructions. I added the following code:

inline Vec4& operator=(const Vec4 &rhs)
xyzw = rhs.xyzw;
return *this;