XNA Math vs VMath, The Heavyweight Fight!
Assuming you are working on highlevel code, being faster than XNA Math is quite difficult. The
options left are: optimize its functions or find flaws on a wellrevised
library, if they exist. But also as stated earlier, even if you are working
with XNA Math and you don't
know how to use it to its full potential you can bloat the code and be left
behind.
To attempt such a challenge I searched for algorithms that
could make the difference be noticeable when compiled by VMath or XNA Math. I knew that algorithms involving only trivial
calculations would run at the same speed. Therefore, the algorithms had to have
some sort of mathematical complexity involved.
With that in mind, my final choices were to port a DSP
code that performs a 3band equalizer on audio data and a cloth simulator. The
URLs to original sample codes can be found at the references, along with their
respective authors.
I
tried as much as I could to keep the original author's code intact so that it
feels like a port and not a different implementation. The 3band
equalizer code was straightforward, but I had to add few extra lines
specifically for the cloth simulation.
During
the implementation, I also included into the sample code a generic vector class, here called VClass, that does
not follow the key features of VMath
or XNA Math. It's a standard
vector class where the data is encapsulated inside a class and does not provide
a procedural interface. I also ported the code to VClass just for performance comparison.
3Band Equalizer Case Study
For
the equalizer algorithm I admittedly cheated. There was not much I could do to
my implementation of VMath
compared to XNA Math. Precisely
what I did was to port the equalizer code using XNA Math and overloaded operators. Then for VMath I did the same but using
procedural calls, because I knew overloaded operators would bloat the code.
However,
my cheat had a purpose, and it was again to demonstrate that even if you are
working on the fastest vector library, you could fall behind if unaware of how
compilers are generating the code.
Also
I prepared the data to be SIMD friendly. So the audio channels are interleaved
so that the loads and stores are minimized. However, to play the channels I had
to reorganize the data to be sent to XAudio2 as four arrays of streamlined PCM
audio data. That caused lots of loads and stores. For that reason I also
included an FPU version that performs the calculations directly to another copy
of the data that is stored as plain arrays of audio samples. The only advantage
of the FPU version was no reorganization of data arrays.
Click
here for the
assembly code generated by VMath
and XNA Math  refer to Table
4.
The
results are very similar to what I have shown during the discussion of "Overloaded
Operators vs. Procedural Interface" (Table 2). The figure below shows a
screenshot of the demo with runtime statistics.
Figure. 1  Three Band Equalizer
The
final statistics are also rewritten on the table below for clarity.
Library

Average
Time

VMath
(procedural calls)

23.79

XNA
Math (overloaded operators)

28.31

VClass

73.49

FPU (*)

78.09

Using
VMath as reference, it was able
to go approximately 16 percent faster on average than XNA Math by simply using a different interface call. Even with
massive loads and stores to recompose the audio data to be passed to XAudio2;
it was able to go more than three times faster than the FPU version. Which is indeed
quite impressive.
Notice
also that the VClass was just
only 5 percent faster than the FPU version due to the huge
amount of code bloat created under the hood.
Cloth Simulation Case Study
The
simulation algorithm has two main loops where most of the calculations occur.
One is the "Verlet" integrator shown below:
void Cloth::Verlet()
{
XMVECTOR d1 = XMVectorReplicate(0.99903f);
XMVECTOR d2 = XMVectorReplicate(0.99899f);
for(int i=0; i
{
XMVECTOR& x = m_x[i];
XMVECTOR temp = x;
XMVECTOR& oldx = m_oldx[i];
XMVECTOR& a = m_a[i];
x += (d1*x)(d2*oldx)+a*fTimeStep*fTimeStep;
oldx = temp;
}
}
The
second is the constraint solver that uses GaussSeidel iteration loop to find
the solution for all constraints shown below:
void Cloth::SatisfyConstraints()
{
XMVECTOR half = XMVectorReplicate(0.5f);
for(int j=0; j
{
m_x[0] = hook[0];
m_x[cClothWidth1] = hook[1];
for(int i=0; i
{
XMVECTOR& x1 = m_x[i];
for(int cc=0; cc
{
int i2 = cnstr[i].cIndex[cc];
XMVECTOR& x2 = m_x[i2];
XMVECTOR delta = x2x1;
XMVECTOR deltalength = XMVectorSqrt(XMVector4Dot(delta,delta));
XMVECTOR diff = (deltalengthrestlength)/deltalength;
x1 += delta*half*diff;
x2 = delta*half*diff;
}
}
}
}
By
inspecting the code, changing overloaded operators to procedural calls won't do
much. The mathematical expressions are simple enough that the optimizer should
generate the same code.
I
also removed the floating casting from the original code as described in "Keep
Results Into SIMD Registers." Now the loop is pretty tight and SIMD
friendly, so what's left? The only thing left is to look inside XNA Math to see if there is anything
else that can be done.
It
turned out that there was. First by looking at the XMVector4Dot, it is implemented
as:
XMFINLINE XMVECTOR XMVector4Dot(FXMVECTOR V1, FXMVECTOR V2)
{
XMVECTOR vTemp2 = V2;
XMVECTOR vTemp = _mm_mul_ps(V1,vTemp2);
vTemp2 = _mm_shuffle_ps(vTemp2,vTemp,_MM_SHUFFLE(1,0,0,0));
vTemp2 = _mm_add_ps(vTemp2,vTemp);
vTemp = _mm_shuffle_ps(vTemp,vTemp2,_MM_SHUFFLE(0,3,0,0));
vTemp = _mm_add_ps(vTemp,vTemp2);
return _mm_shuffle_ps(vTemp,vTemp,_MM_SHUFFLE(2,2,2,2));
}
The
implementation is composed by 1 multiplication, 3 shuffles, and 2 adds.
So,
I went and wrote a different SSE2 4D Dot that produces the same results but
with one less shuffle instruction, as follows:
inline Vec4 Dot(Vec4 va, Vec4 vb)
{
Vec4 t0 = _mm_mul_ps(va, vb);
Vec4 t1 = _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(1,0,3,2));
Vec4 t2 = _mm_add_ps(t0, t1);
Vec4 t3 = _mm_shuffle_ps(t2, t2, _MM_SHUFFLE(2,3,0,1));
Vec4 dot = _mm_add_ps(t3, t2);
return (dot);
}
Unfortunately,
to my surprise, my new 4D Dot did not make much of a difference, likely because
it had more interdependencies between instructions.
As I
looked further, I found one puzzling detail on XNA Math. It calls the reciprocal function to perform its divide.
In fact the function chained by multiple calls from the overloaded operator as
such:
XMFINLINE XMVECTOR XMVectorReciprocal(FXMVECTOR V)
{
return _mm_div_ps(g_XMOne,V);
}
XMFINLINE XMVECTOR operator/ (FXMVECTOR V1,FXMVECTOR V2)
{
XMVECTOR InvV = XMVectorReciprocal(V2);
return XMVectorMultiply(V1, InvV);
}
The
only problem is that it loads the "g_XMOne" and then calls the intrinsics
to perform the divide. I am not quite clear why Microsoft implemented this way,
but it would have been better to simply call the divide directly. So for VMath I implemented this way, meaning
without extra loads as such:
inline Vec4 VDiv(Vec4 va, Vec4 vb)
{
return(_mm_div_ps(va, vb));
};
Now
let's look at the assembler of the "Verlet" and "Constraint
Solver" using both libraries, with the small optimizations implemented in VMath.
Click
here to download
the Table document  refer to Table 5 and Table 6.
Although,
I wrote the "Verlet" function using procedural calls for VMath and
overloaded operators for XNA Math, there was no difference. I expect the math
expressions were not complex enough to cause code bloat. However, the "Contraint
Solver" got precisely one instruction smaller due to the faster divide. I
highlighted the extra load of the "g_XMOne" on the table that likely
the cause of the extra movaps instruction. The screen shot below shows the
results.
Figure. 2  Cloth Simulation
Library

Average
Time

VMath
(procedural calls)

75.78

XNA
Math (overloaded operators)

76.73

VClass

138.52

FPU

Not
Implemented

Technically
it was a knockout by VMath because of one instruction; however, the statistics
were not sufficient enough to show significant speed boost  approximately
only 1 percent faster.
In
fact the results were so close that they floated depending on other factors,
such as the OS doing some background tasks. Because of this kind of result, the
conclusion is that VMath and XNA Math were practically even.
Nonetheless,
both libraries were able to go approximately 50 percent faster than VClass. This is still impressive,
counting the fact that the major difference is how the libraries interface with
the same intrinsics.