Designing Fast Cross-Platform SIMD Vector Libraries
January 20, 2010 Page 1 of 5
Inside most 3D applications there exists a vector library to perform routine calculations such as vector arithmetic, logic, comparison, dot and cross products, and so on. Although there are countless ways to go about designing this type of library, developers often miss key factors to allow a vector library to perform these calculations the fastest possible way.
Around late 2004 I was assigned to develop such a vector library code-named VMath, standing for "Vector Math." The primary goal of VMath was not only to be the fastest but also easily portable across different platforms.
To my surprise in 2009, the compiler technology has not changed much. Indeed the results presented in this article resulted from my research during that time, with some exceptions, are nearly the same that when I was working on VMath five years ago.
Since this article is mostly written in C++ and focused primarily in performance, defining the fastest library can be misleading sometimes.
Therefore, the fastest library as described in here, is the one that generates the smallest assembly code compared to other libraries when compiling the same code using the same settings (assuming release mode of course). This is because the fastest library generates fewer instructions to perform the same exact calculations. Or in other words, the fastest library is the one the bloats the code the least.
With the wide spread of single instruction multiple data instructions (SIMD) around modern processors the task of developing a vector library has become much easier. SIMD operations work on SIMD registers precisely as FPU operations on FPU registers. However, the advantage is that SIMD registers are usually 128-bit wide forming the quad-word: four "floats" or "ints" of 32-bits each. This allows developers to perform 4D vector calculations with a single instruction. Because of that, the best feature a vector library can have is to take advantage of the SIMD instructions in it.
Nonetheless, when working with SIMD instructions you must watch out for common mistakes that can cause the library to bloat the code. In fact the code bloat of a SIMD vector library can be drastic to a point that it would have been better to simply use FPU instructions.
Vector Library Interface
The best way to talk to SIMD instructions when designing a high level interface of a vector library is by the usage of intrinsics. They are available from most compilers that target processors with SIMD instructions. Also, each instrisics translates into a single SIMD instruction. However the advantage of using intrinsics instead of writing assembly directly is to allow the compiler to perform scheduling and expression optimization. That can significantly minimize code bloat.
Examples of instrinsics below:
Intel & AMD:
vr = _mm_add_ps(va, vb);
Cell Processor (SPU):
vr = spu_add(va, vb);
vr = vec_add(va, vb);
1. Returning results by value
By observing the intrisics interface a vector library must imitate that interface to maximize performance. Therefore, you must return the results by value and not by reference, as such:
inline Vec4 VAdd(Vec4 va, Vec4 vb)
On the other hand if the data is returned by reference the interface will generate code bloat. The incorrect version below:
//incorrect (code bloat!)
inline void VAddSlow(Vec4& vr, Vec4 va, Vec4 vb)
vr = _mm_add_ps(va, vb);
The reason you must return data by value is because the quad-word (128-bit) fits nicely inside one SIMD register. And one of the key factors of a vector library is to keep the data inside these registers as much as possible. By doing that, you avoid unnecessary loads and stores operations from SIMD registers to memory or FPU registers. When combining multiple vector operations the "returned by value" interface allows the compiler to optimize these loads and stores easily by minimizing SIMD to FPU or memory transfers.
2. Data Declared "Purely"
Here, "pure data" is defined as data declared outside a "class" or "struct" by a simple "typedef" or "define". When I was researching various vector libraries before coding VMath, I observed one common pattern among all libraries I looked at during that time. In all cases, developers wrapped the basic quad-word type inside a "class" or "struct" instead of declaring it purely, as follows:
This type of data encapsulation is a common practice among C++ developers to make the architecture of the software robust. The data is protected and can be accessed only by the class interface functions. Nonetheless, this design causes code bloat by many different compilers in different platforms, especially if some sort of GCC port is being used.
An approach that is much friendlier to the compiler is to declare the vector data "purely", as follows:
typedef __m128 Vec4;
Admittedly a vector library designed that way will lose the nice encapsulation and protection of its fundamental data. However the payoff is certainly noticeable. Let's look at an example to clarify the problem.
We can approximate the sine function by using Maclaurin (*) series as below:
(*) There are better and faster ways to approximate the sine function in production code. The Maclaurin series is used here just for illustrative purposes.
If a developer codes a vector version of a sine function using the formula above the code would look like more or less:
Vec4 VSin(const Vec4& x)
Vec4 c1 = VReplicate(-1.f/6.f);
Vec4 c2 = VReplicate(1.f/120.f);
Vec4 c3 = VReplicate(-1.f/5040.f);
Vec4 c4 = VReplicate(1.f/362880);
Vec4 c5 = VReplicate(-1.f/39916800);
Vec4 c6 = VReplicate(1.f/6227020800);
Vec4 c7 = VReplicate(-1.f/1307674368000);
Vec4 res = x +
Now let's look at the assembler of the same function compiled as Vec4 declared "purely" (left column) and declared inside a class (right column). (Click here to download the table document. Refer to Table 1.)
The same exact code shrinks by a factor of approximate 15 percent simply by changing how the fundamental data was declared. If this function was performing some inner loop calculation, there would not only be savings in code size but certainly would run faster.
Page 1 of 5