Introduction
Inside
most 3D applications there exists a vector library to perform routine
calculations such as vector arithmetic, logic, comparison, dot and cross
products, and so on. Although there are countless ways to go about designing
this type of library, developers often miss key factors to allow a vector
library to perform these calculations the fastest possible way.
Around
late 2004 I was assigned to develop such a vector library code-named VMath, standing for "Vector
Math." The primary goal of VMath
was not only to be the fastest but also easily portable across different
platforms.
To my surprise in 2009, the compiler technology has not
changed much. Indeed the results presented in this article resulted from my
research during that time, with some exceptions, are nearly the same that when
I was working on VMath five
years ago.
"Fastest" Library
Since
this article is mostly written in C++ and focused primarily in performance,
defining the fastest library can be misleading sometimes.
Therefore, the
fastest library as described in here, is the one that generates the smallest
assembly code compared to other libraries when compiling the same code using
the same settings (assuming release mode of course). This is because the
fastest library generates fewer instructions to perform the same exact
calculations. Or in other words, the fastest library is the one the bloats the
code the least.
SIMD Instructions
With the wide spread of single instruction multiple data
instructions (SIMD) around modern processors the task of developing a vector
library has become much easier. SIMD operations work on SIMD registers
precisely as FPU operations on FPU
registers. However, the advantage is that SIMD registers are usually 128-bit
wide forming the quad-word: four "floats" or "ints" of
32-bits each. This allows developers to
perform 4D vector calculations with a single instruction. Because of that, the
best feature a vector library can have is to take advantage of the SIMD
instructions in it.
Nonetheless,
when working with SIMD instructions you must watch out for common mistakes that
can cause the library to bloat the code. In fact the code bloat of a SIMD
vector library can be drastic to a point that it would have been better to
simply use FPU instructions.
Vector Library Interface
The
best way to talk to SIMD instructions when designing a high level interface of
a vector library is by the usage of intrinsics. They are available from most
compilers that target processors with SIMD instructions. Also, each instrisics
translates into a single SIMD instruction.
However the advantage of using intrinsics instead of writing assembly
directly is to allow the compiler to perform scheduling and expression
optimization. That can significantly minimize code bloat.
Examples
of instrinsics below:
Intel
& AMD:
vr =
_mm_add_ps(va, vb);
Cell
Processor (SPU):
vr =
spu_add(va, vb);
Altivec:
vr =
vec_add(va, vb);
1. Returning results by value
By
observing the intrisics interface a vector library must imitate that interface
to maximize performance. Therefore, you must return the results by value
and not by reference, as such:
//correct
inline Vec4 VAdd(Vec4 va, Vec4 vb)
{
return(_mm_add_ps(va, vb));
};
On the other hand if the data
is returned by reference the interface will generate code bloat. The incorrect
version below:
//incorrect (code bloat!)
inline void VAddSlow(Vec4& vr, Vec4 va, Vec4 vb)
{
vr = _mm_add_ps(va, vb);
};
The
reason you must return data by value is because the quad-word (128-bit) fits
nicely inside one SIMD register. And one of the key factors of a vector library
is to keep the data inside these registers as much as possible. By doing that,
you avoid unnecessary loads and stores operations from SIMD registers to memory
or FPU registers. When combining
multiple vector operations the "returned by value" interface allows
the compiler to optimize these loads and stores easily by minimizing SIMD to FPU or memory transfers.
2. Data Declared "Purely"
Here,
"pure data" is defined as data declared outside a "class"
or "struct" by a simple "typedef" or "define".
When I was researching various vector libraries before coding VMath, I observed one common pattern
among all libraries I looked at during that time. In all cases, developers
wrapped the basic quad-word type inside a "class" or "struct"
instead of declaring it purely, as follows:
class Vec4
{
...
private:
__m128 xyzw;
};
This
type of data encapsulation is a common practice among C++ developers to make
the architecture of the software robust. The data is protected and can be accessed
only by the class interface functions. Nonetheless, this design causes code
bloat by many different compilers in different platforms, especially if some
sort of GCC port is being used.
An
approach that is much friendlier to the compiler is to declare the vector data "purely",
as follows:
typedef __m128 Vec4;
Admittedly
a vector library designed that way will lose the nice encapsulation and
protection of its fundamental data. However the payoff is certainly noticeable.
Let's look at an example to clarify the problem.
We
can approximate the sine function by using Maclaurin (*) series as below:
(*) There are better and faster ways to approximate the sine
function in production code. The Maclaurin series is used here just for
illustrative purposes.
If a
developer codes a vector version of a sine function using the formula above the
code would look like more or less:
Vec4 VSin(const Vec4& x)
{
Vec4 c1 = VReplicate(-1.f/6.f);
Vec4 c2 = VReplicate(1.f/120.f);
Vec4 c3 = VReplicate(-1.f/5040.f);
Vec4 c4 = VReplicate(1.f/362880);
Vec4 c5 = VReplicate(-1.f/39916800);
Vec4 c6 = VReplicate(1.f/6227020800);
Vec4 c7 = VReplicate(-1.f/1307674368000);
Vec4 res = x +
c1*x*x*x +
c2*x*x*x*x*x +
c3*x*x*x*x*x*x*x +
c4*x*x*x*x*x*x*x*x*x +
c5*x*x*x*x*x*x*x*x*x*x*x +
c6*x*x*x*x*x*x*x*x*x*x*x*x*x +
c7*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x;
return (res);
}
Now
let's look at the assembler of the same function compiled as Vec4 declared "purely"
(left column) and declared inside a class (right column). (Click here to download the
table document. Refer to Table 1.)
The
same exact code shrinks by a factor of approximate 15 percent simply by
changing how the fundamental data was declared. If this function was performing
some inner loop calculation, there would not only be savings in code size but
certainly would run faster.
|
Here is a little background that might illuminate the design of XNAMath for anyone who is interested:
"XNAMath" began life back in 2004 as "xboxmath" for the Xbox 360, and was originally developed by the Xbox 360 graphics team. The original implementation was a scalar version, which is retained today as the NO_INTRINSICS codepath, and then extended with a VMX128 optimized codepath. VMX128 is the Xbox 360 SIMD instruction set, a variant of PowerPC AltiVec. In late 2008/early 2009, we extended it again for SSE2 to provide optimized cross-platform support with Windows x86 and Windows x64 and "XNAMath" is the result of this work. Incidentally, this is why XNAMath's version number is 2.0 since it is "xboxmath 2.0" from the perspective of Xbox 360 developers.
Many of the original design features of "xboxmath" have been preserved in the current implementation to ensure that users can easily move math code between Windows and Xbox 360 and retain the SIMD performance benefits. The API is therefore designed to basically map to what the VMX128 instruction set can do. This is why, for example, the XNAMath library is missing some things you can do easily in SSE2 like more integer-based operations. This has resulted in some quirks, but for games and high-performance graphics applications this is generally not a major limitation. You can always drop into platform-specific code at any point and of course since XNAMath is completely inline all the source is available for reference.
One of the points you noted was that operator/ took a complex path that ended up doing a full divide followed by a mul trying to avoid the divide in the first place This is the result of two basic design decisions in the library:
(1) You'll note that there is no XMVectorDivide() operator. The assumption is that you will do a reciprocal and a multiply, which is what you have to do on VMX128 as it lacks a direct divide operator. The SSE version of XMVectorReciprocal() does do a divide by 1 because we want it to return a full precision result generally, but you could also choose to use XMVectorReciprocalEst() instead which avoids the divide. I've noted a bug to look at just adding an XMVectorDivide() function directly since that would give us an opportunity to optimize the SSE2 path for a case that was equivalent for VMX128, and then we can update operator/ to use XMVectorDivide() directly.
(2) In general, we don't focus on optimizing operator overloads. They are there for convenience, but in general we assume if you really care about performance you will avoid them in bottleneck functions. For high-performance code, you'd be better off use #define XM_NO_OPERATOR_OVERLOADS and always using the explicit functional path. Operator overloads certainly work, and are extremely convenient, so we left the use of them up to the user instead of trying to make a policy decision to completely disallow them. The majority of operator overloads are trivial calls. The operator / and operator /= are the only operators where we do more than one function, so adding XMVectorDivide() should address this concern.
Again it is great to see this level of deep technical information being shared with the community, and I’m happy to see XNAMath being a useful part of your projects. Thanks again and keep up the good work.
Didn't learn much about vector maths I did not know already, but did pick up some stuff I did not know about compiler optimisations.
So in that sense the article did exactly what it said on the box. Thanks.
"To my surprise in 2009, the compiler technology has not changed much" - ... made me laugh, sorry. I am not that surprised.
"The vector_math library (previously known as math3d++) is a C++ 3d math library" Link http://www.trenki.net/content/view/16/36/
Also reminded me of this from 2006: http://simdx86.sourceforge.net/
Compiler's are limited to the code generation formulae built-in, and though far better than ye olde compilers, are still in need of the human eyeball for some serious code squeezing. But for many programmers who would not delve deep into this area, Mr. Oliveira has provided information for understanding the inner workings of the compiler and how better to improve the output code performance. Excellent.
I'm a bit curious about the FPU implementation, because the most interesting result was that the Intel compiler made that test "almost competitive" with the SIMD approaches ("only" ~80% slower, compared to the ~200% slower with the Microsoft compiler). Was this using 3-float vectors or 4-float vectors? Based on the name of the test, I assume SSE optimizations were turned off, or are you considering SSE to be part of the FPU?
The reason I ask is because the reduced size of 3-float vectors tends to make a significant speed advantage with non-SIMD implementations. And with SSE optimizations turned on, you can (sometimes) get another small boost. While these changes would certainly not make up the performance difference, they might make it close enough that the memory advantage comes into play, not to mention avoiding the headaches of coding with alignment and intrinsics.
Thanks for any info!
XMFINLINE XMVECTOR XMVectorDivide(FXMVECTOR V1, FXMVECTOR V2)
{
#if defined(_XM_NO_INTRINSICS_)
XMVECTOR Result;
UINT i;
// Avoid C4701
Result.vector4_f32[0] = 0.0f;
for (i = 0; i < 4; i++)
{
if (XMISINF(V2.vector4_f32[i]))
{
Result.vector4_f32[i] = (V2.vector4_f32[i] < 0.0f) ? -0.0f : 0.0f;
}
else if (V2.vector4_f32[i] == -0.0f || V2.vector4_f32[i] == 0.0f)
{
Result.vector4_u32[i] = 0x7F800000;
}
else
{
Result.vector4_f32[i] = V1.vector4_f32[i] / V2.vector4_f32[i];
}
}
return Result;
#elif defined(_XM_SSE_INTRINSICS_)
return _mm_div_ps( V1, V2 );
#else // _XM_VMX128_INTRINSICS_
// This case for Xbox 360 is equivalent to:
// XMVECTOR R;
//
// R = XMVectorReciprocal(V2);
// return XMVectorMultiply(V1,R);
#endif // _XM_VMX128_INTRINSICS_
}
Then update the following to call it...
XMFINLINE XMVECTOR& operator/=(XMVECTOR& V1, FXMVECTOR V2)
{
V1 = XMVectorDivide(V1,V2);
return V1;
}
XMFINLINE XMVECTOR operator/(FXMVECTOR V1, FXMVECTOR V2)
{
return XMVectorDivide(V1,V2);
}
Of course it is actually faster to call XMVectorReciprocalEst() followed by XMVectorMultiply() if you are willing to tolerate the loss of precision.
These are all great points for trying to get generic C++ code to hopefully compile as best as possible.
While I think compilers can take you most of the way in some cases (especially Intel's latest and greatest), I still feel it is important that for critical yet non-trivial functions you optimize in architecture specific assembly.
It's a giant pain, but if you need to crank out some math, it's the only way to be sure you are as optimized as possible.
Are you absolutely sure about this little rule? I'm pretty sure that even the SPU compiler knows when it can treat a reference to a class that only contains a 128-bit data type, as something it can pass to the inlined operator in the register.
I did this with compound addition statements, and found that there were no issues at all with passing a reference to a class (it had to be a reference because of alignment declaration rules). In the code, they were all passed around via the registers.
The only time I found that any data had to be strored/loaded was when I was doing explicitly doing this for myself.
Have you checked the dissasembly on your target platforms with this at all?
The folks working on GCC's auto-vectorization might be surprised to hear that.
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
It appears to support AltiVec and SSE/SSE2. I would be a great addition to the article to see how well the auto-vectorize feature of GCC compares to vector-aware libraries like XNA Math and VMath.
And perhaps a newbie question -- why disable SSE2? "The SSE2 optimizations are all off."
It motivated me to write article about using STL vector for XNAMath types on x86 platform.
The article targets Windows developers who are new to SIMD.
You can read it at:
http://www.qsoftz.com/mirza/?p=59
After studying your code I believe i see the root cause to your troubles with longer expressions.
Your vclass.h Vec4 does not implement the implicit constructor Vec4::Vec4(Vec4 &), so this function will have to be generated by the c++ compiler according to the C++ spec. This default is a simply a copy which might explain why the assembly in table 2 is spilling registers to memory.
Two simple solutions.
1) Dont use the copy constructor within the operator implementations. i.e. simply change:
inline const Vec4 operator+(const Vec4 &rhs) const
{
Vec4 res(*this);
res.xyzw = _mm_add_ps(res.xyzw, rhs.xyzw);
return(res);
};
To:
inline const Vec4 operator+(const Vec4 &rhs) const
{
Vec4 res(); // dont initialize and use copy constructor here
res.xyzw = _mm_add_ps(xyzw, rhs.xyzw); // res only on lhs
return(res);
};
and do the same for the other operators
Alternatively just add:
inline Vec4(Vec4 &rhs) { xyzw=rhs.xyzw;}
Then retest and hopefully the vclass will now be the same as the vmath versions.
Let us know if this fixes your problems with C++ simd usage.
thanks
stan melax
Excellent work by the way Gustavo in providing a demo in which it was very simple to replicate your results. Thanks for the article.
http://www.guitarv.com/download/articles/article-vmath-vclass-cpy-ctr.zip
Bear also in mind that when I was doing these tests I was not only looking at Windows projects. This article looks like a PC-only, but it really is not. The same type of architecture was cross-compiled in different platforms producing similar results. The ideal test is if someone (unfortunately, I don't have access to PS2, PSP or PS3 compilers anymore) is to try to port this project to the various platforms with intrisics to see what are the final results.
All that said, I am still endosing the XNAMath architecure which is quite similar to what I had used in VMath many years ago.
thank for the great comments.
Unfortunately, you cannot always work at the level of the native type, because for example it's not possible to distinguish the various possible integer types from the type __m128i (unlike in altivec, where they all have a different type), so it needs to be wrapped in a type (preferably a POD, compilers generate better code for those -- that means no protected or private).
Operator overloading is also a must; with my library we even recognize patterns such as a * b + c and map them to vec_madd if such instructions are available.
A problem is that MSVC refuses to pass a structure that contains a __m128 by value, even though it does allow to pass a __m128 by value. As a result, we have to pass by reference, which may cause some compilers to incorrectly go back to memory and not keep everything in registers.
Basically, if everything gets inlined then there is no problem, but as soon as things do not, we run into problems which are purely due to calling conventions.
Those problems do not have easy solutions however, unless you can hook into the compiler to add new calling conventions that do the right thing (fastcall is not enough).
Thankfully, some compilers, like clang, can do a very good job at converting pass-by-reference to pass-by-value or optimizing calling conventions. For most platforms, inlining is also not a problem, and it solves those issues.
I hate to chip in long after the fact here, but for anyone else reading this, I believe Philip is correct - at least for the case of the MSVC compiler. I had to test this out as it went against what I've seen elsewhere.
I ran your code contained in http://www.guitarv.com/download/articles/article-vmath-vclass-cpy-ctr.zip and it seems all the calculations are done in the same number of calls. The extra instructions you are seeing are just caused by writing back into the history buffer after all of the actual operations have completed:
// Shuffle history buffer
es->sdm3 = es->sdm2;
es->sdm2 = es->sdm1;
es->sdm1 = sample;
As there is no Vec4::'=' operator, the compiler generates a general purpose copy, which involves copying out of the SSE registers to main memory (many operations and also slow in itself).
Providing the required operator generates code with exactly the same number of instructions. I added the following code:
inline Vec4& operator=(const Vec4 &rhs)
{
xyzw = rhs.xyzw;
return *this;
}
Adrian