Introduction
Inside
most 3D applications there exists a vector library to perform routine
calculations such as vector arithmetic, logic, comparison, dot and cross
products, and so on. Although there are countless ways to go about designing
this type of library, developers often miss key factors to allow a vector
library to perform these calculations the fastest possible way.
Around
late 2004 I was assigned to develop such a vector library code-named VMath, standing for "Vector
Math." The primary goal of VMath
was not only to be the fastest but also easily portable across different
platforms.
To my surprise in 2009, the compiler technology has not
changed much. Indeed the results presented in this article resulted from my
research during that time, with some exceptions, are nearly the same that when
I was working on VMath five
years ago.
"Fastest" Library
Since
this article is mostly written in C++ and focused primarily in performance,
defining the fastest library can be misleading sometimes.
Therefore, the
fastest library as described in here, is the one that generates the smallest
assembly code compared to other libraries when compiling the same code using
the same settings (assuming release mode of course). This is because the
fastest library generates fewer instructions to perform the same exact
calculations. Or in other words, the fastest library is the one the bloats the
code the least.
SIMD Instructions
With the wide spread of single instruction multiple data
instructions (SIMD) around modern processors the task of developing a vector
library has become much easier. SIMD operations work on SIMD registers
precisely as FPU operations on FPU
registers. However, the advantage is that SIMD registers are usually 128-bit
wide forming the quad-word: four "floats" or "ints" of
32-bits each. This allows developers to
perform 4D vector calculations with a single instruction. Because of that, the
best feature a vector library can have is to take advantage of the SIMD
instructions in it.
Nonetheless,
when working with SIMD instructions you must watch out for common mistakes that
can cause the library to bloat the code. In fact the code bloat of a SIMD
vector library can be drastic to a point that it would have been better to
simply use FPU instructions.
Vector Library Interface
The
best way to talk to SIMD instructions when designing a high level interface of
a vector library is by the usage of intrinsics. They are available from most
compilers that target processors with SIMD instructions. Also, each instrisics
translates into a single SIMD instruction.
However the advantage of using intrinsics instead of writing assembly
directly is to allow the compiler to perform scheduling and expression
optimization. That can significantly minimize code bloat.
Examples
of instrinsics below:
Intel
& AMD:
vr =
_mm_add_ps(va, vb);
Cell
Processor (SPU):
vr =
spu_add(va, vb);
Altivec:
vr =
vec_add(va, vb);
1. Returning results by value
By
observing the intrisics interface a vector library must imitate that interface
to maximize performance. Therefore, you must return the results by value
and not by reference, as such:
//correct
inline Vec4 VAdd(Vec4 va, Vec4 vb)
{
return(_mm_add_ps(va, vb));
};
On the other hand if the data
is returned by reference the interface will generate code bloat. The incorrect
version below:
//incorrect (code bloat!)
inline void VAddSlow(Vec4& vr, Vec4 va, Vec4 vb)
{
vr = _mm_add_ps(va, vb);
};
The
reason you must return data by value is because the quad-word (128-bit) fits
nicely inside one SIMD register. And one of the key factors of a vector library
is to keep the data inside these registers as much as possible. By doing that,
you avoid unnecessary loads and stores operations from SIMD registers to memory
or FPU registers. When combining
multiple vector operations the "returned by value" interface allows
the compiler to optimize these loads and stores easily by minimizing SIMD to FPU or memory transfers.
2. Data Declared "Purely"
Here,
"pure data" is defined as data declared outside a "class"
or "struct" by a simple "typedef" or "define".
When I was researching various vector libraries before coding VMath, I observed one common pattern
among all libraries I looked at during that time. In all cases, developers
wrapped the basic quad-word type inside a "class" or "struct"
instead of declaring it purely, as follows:
class Vec4
{
...
private:
__m128 xyzw;
};
This
type of data encapsulation is a common practice among C++ developers to make
the architecture of the software robust. The data is protected and can be accessed
only by the class interface functions. Nonetheless, this design causes code
bloat by many different compilers in different platforms, especially if some
sort of GCC port is being used.
An
approach that is much friendlier to the compiler is to declare the vector data "purely",
as follows:
typedef __m128 Vec4;
Admittedly
a vector library designed that way will lose the nice encapsulation and
protection of its fundamental data. However the payoff is certainly noticeable.
Let's look at an example to clarify the problem.
We
can approximate the sine function by using Maclaurin (*) series as below:
(*) There are better and faster ways to approximate the sine
function in production code. The Maclaurin series is used here just for
illustrative purposes.
If a
developer codes a vector version of a sine function using the formula above the
code would look like more or less:
Vec4 VSin(const Vec4& x)
{
Vec4 c1 = VReplicate(-1.f/6.f);
Vec4 c2 = VReplicate(1.f/120.f);
Vec4 c3 = VReplicate(-1.f/5040.f);
Vec4 c4 = VReplicate(1.f/362880);
Vec4 c5 = VReplicate(-1.f/39916800);
Vec4 c6 = VReplicate(1.f/6227020800);
Vec4 c7 = VReplicate(-1.f/1307674368000);
Vec4 res = x +
c1*x*x*x +
c2*x*x*x*x*x +
c3*x*x*x*x*x*x*x +
c4*x*x*x*x*x*x*x*x*x +
c5*x*x*x*x*x*x*x*x*x*x*x +
c6*x*x*x*x*x*x*x*x*x*x*x*x*x +
c7*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x;
return (res);
}
Now
let's look at the assembler of the same function compiled as Vec4 declared "purely"
(left column) and declared inside a class (right column). (Click here to download the
table document. Refer to Table 1.)
The
same exact code shrinks by a factor of approximate 15 percent simply by
changing how the fundamental data was declared. If this function was performing
some inner loop calculation, there would not only be savings in code size but
certainly would run faster.