Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.   Contents
Designing Fast Cross-Platform SIMD Vector Libraries Latest Jobs
View All     RSS
June 12, 2021 Latest Blogs
View All     Post     RSS
June 12, 2021 Press Releases
June 12, 2021
Games Press
View All     RSS About Gama Network
If you enjoy reading this site, you might also want to check out these UBM Tech sites: Features

# Designing Fast Cross-Platform SIMD Vector Libraries

January 20, 2010  Page 4 of 5 # XNA Math vs VMath, The Heavyweight Fight!

Assuming you are working on high-level code, being faster than XNA Math is quite difficult. The options left are: optimize its functions or find flaws on a well-revised library, if they exist. But also as stated earlier, even if you are working with XNA Math and you don't know how to use it to its full potential you can bloat the code and be left behind.

To attempt such a challenge I searched for algorithms that could make the difference be noticeable when compiled by VMath or XNA Math. I knew that algorithms involving only trivial calculations would run at the same speed. Therefore, the algorithms had to have some sort of mathematical complexity involved.

With that in mind, my final choices were to port a DSP code that performs a 3-band equalizer on audio data and a cloth simulator. The URLs to original sample codes can be found at the references, along with their respective authors.

I tried as much as I could to keep the original author's code intact so that it feels like a port and not a different implementation. The 3-band equalizer code was straightforward, but I had to add few extra lines specifically for the cloth simulation.

During the implementation, I also included into the sample code a generic vector class, here called VClass, that does not follow the key features of VMath or XNA Math. It's a standard vector class where the data is encapsulated inside a class and does not provide a procedural interface. I also ported the code to VClass just for performance comparison.

### 3-Band Equalizer Case Study

For the equalizer algorithm I admittedly cheated. There was not much I could do to my implementation of VMath compared to XNA Math. Precisely what I did was to port the equalizer code using XNA Math and overloaded operators. Then for VMath I did the same but using procedural calls, because I knew overloaded operators would bloat the code.

However, my cheat had a purpose, and it was again to demonstrate that even if you are working on the fastest vector library, you could fall behind if unaware of how compilers are generating the code.

Also I prepared the data to be SIMD friendly. So the audio channels are interleaved so that the loads and stores are minimized. However, to play the channels I had to re-organize the data to be sent to XAudio2 as four arrays of streamlined PCM audio data. That caused lots of loads and stores. For that reason I also included an FPU version that performs the calculations directly to another copy of the data that is stored as plain arrays of audio samples. The only advantage of the FPU version was no re-organization of data arrays.

Click here for the assembly code generated by VMath and XNA Math -- refer to Table 4.

The results are very similar to what I have shown during the discussion of "Overloaded Operators vs. Procedural Interface" (Table 2). The figure below shows a screenshot of the demo with runtime statistics. Figure. 1 - Three Band Equalizer

The final statistics are also re-written on the table below for clarity.

 Library Average Time VMath (procedural calls) 23.79 XNA Math (overloaded operators) 28.31 VClass 73.49 FPU (*) 78.09

Using VMath as reference, it was able to go approximately 16 percent faster on average than XNA Math by simply using a different interface call. Even with massive loads and stores to re-compose the audio data to be passed to XAudio2; it was able to go more than three times faster than the FPU version. Which is indeed quite impressive.

Notice also that the VClass was just only 5 percent faster than the FPU version due to the huge amount of code bloat created under the hood.

### Cloth Simulation Case Study

The simulation algorithm has two main loops where most of the calculations occur. One is the "Verlet" integrator shown below:

void Cloth::Verlet()
{
XMVECTOR d1 = XMVectorReplicate(0.99903f);
XMVECTOR d2 = XMVectorReplicate(0.99899f);

for(int i=0; i
{
XMVECTOR& x = m_x[i];
XMVECTOR temp = x;
XMVECTOR& oldx = m_oldx[i];
XMVECTOR& a = m_a[i];
x += (d1*x)-(d2*oldx)+a*fTimeStep*fTimeStep;
oldx = temp;
}
}

The second is the constraint solver that uses Gauss-Seidel iteration loop to find the solution for all constraints shown below:

void Cloth::SatisfyConstraints()
{
XMVECTOR half = XMVectorReplicate(0.5f);

for(int j=0; j
{
m_x = hook;
m_x[cClothWidth-1] = hook;

for(int i=0; i
{
XMVECTOR& x1 = m_x[i];

for(int cc=0; cc
{
int i2 = cnstr[i].cIndex[cc];

XMVECTOR& x2 = m_x[i2];
XMVECTOR delta = x2-x1;
XMVECTOR deltalength = XMVectorSqrt(XMVector4Dot(delta,delta));
XMVECTOR diff = (deltalength-restlength)/deltalength;
x1 += delta*half*diff;
x2 -= delta*half*diff;
}
}
}
}

By inspecting the code, changing overloaded operators to procedural calls won't do much. The mathematical expressions are simple enough that the optimizer should generate the same code.

I also removed the floating casting from the original code as described in "Keep Results Into SIMD Registers." Now the loop is pretty tight and SIMD friendly, so what's left? The only thing left is to look inside XNA Math to see if there is anything else that can be done.

It turned out that there was. First by looking at the XMVector4Dot, it is implemented as:

XMFINLINE XMVECTOR XMVector4Dot(FXMVECTOR V1, FXMVECTOR V2)
{ XMVECTOR vTemp2 = V2;
XMVECTOR vTemp = _mm_mul_ps(V1,vTemp2);
vTemp2 = _mm_shuffle_ps(vTemp2,vTemp,_MM_SHUFFLE(1,0,0,0));
vTemp2 = _mm_add_ps(vTemp2,vTemp);
vTemp = _mm_shuffle_ps(vTemp,vTemp2,_MM_SHUFFLE(0,3,0,0));
vTemp = _mm_add_ps(vTemp,vTemp2);
return _mm_shuffle_ps(vTemp,vTemp,_MM_SHUFFLE(2,2,2,2));
}

The implementation is composed by 1 multiplication, 3 shuffles, and 2 adds.

So, I went and wrote a different SSE2 4D Dot that produces the same results but with one less shuffle instruction, as follows:

inline Vec4 Dot(Vec4 va, Vec4 vb)
{
Vec4 t0 = _mm_mul_ps(va, vb);
Vec4 t1 = _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(1,0,3,2));
Vec4 t2 = _mm_add_ps(t0, t1);
Vec4 t3 = _mm_shuffle_ps(t2, t2, _MM_SHUFFLE(2,3,0,1));
Vec4 dot = _mm_add_ps(t3, t2);
return (dot);
}

Unfortunately, to my surprise, my new 4D Dot did not make much of a difference, likely because it had more interdependencies between instructions.

As I looked further, I found one puzzling detail on XNA Math. It calls the reciprocal function to perform its divide. In fact the function chained by multiple calls from the overloaded operator as such:

XMFINLINE XMVECTOR XMVectorReciprocal(FXMVECTOR V)
{
return _mm_div_ps(g_XMOne,V);
}

XMFINLINE XMVECTOR operator/ (FXMVECTOR V1,FXMVECTOR V2)
{
XMVECTOR InvV = XMVectorReciprocal(V2);
return XMVectorMultiply(V1, InvV);
}

The only problem is that it loads the "g_XMOne" and then calls the intrinsics to perform the divide. I am not quite clear why Microsoft implemented this way, but it would have been better to simply call the divide directly. So for VMath I implemented this way, meaning without extra loads as such:

inline Vec4 VDiv(Vec4 va, Vec4 vb)
{
return(_mm_div_ps(va, vb));
};

Now let's look at the assembler of the "Verlet" and "Constraint Solver" using both libraries, with the small optimizations implemented in VMath.

Click here to download the Table document -- refer to Table 5 and Table 6.

Although, I wrote the "Verlet" function using procedural calls for VMath and overloaded operators for XNA Math, there was no difference. I expect the math expressions were not complex enough to cause code bloat. However, the "Contraint Solver" got precisely one instruction smaller due to the faster divide. I highlighted the extra load of the "g_XMOne" on the table that likely the cause of the extra movaps instruction. The screen shot below shows the results. Figure. 2 - Cloth Simulation

 Library Average Time VMath (procedural calls) 75.78 XNA Math (overloaded operators) 76.73 VClass 138.52 FPU Not Implemented

Technically it was a knockout by VMath because of one instruction; however, the statistics were not sufficient enough to show significant speed boost -- approximately only 1 percent faster.

In fact the results were so close that they floated depending on other factors, such as the OS doing some background tasks. Because of this kind of result, the conclusion is that VMath and XNA Math were practically even.

Nonetheless, both libraries were able to go approximately 50 percent faster than VClass. This is still impressive, counting the fact that the major difference is how the libraries interface with the same intrinsics.  Page 4 of 5 ### Top Stories

Loading Comments 