Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Gamasutra: The Art & Business of Making Gamesspacer
Designing Fast Cross-Platform SIMD Vector Libraries
View All     RSS
June 12, 2021
arrowPress Releases
June 12, 2021
Games Press
View All     RSS







If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 

Designing Fast Cross-Platform SIMD Vector Libraries


January 20, 2010 Article Start Previous Page 2 of 5 Next
 

4. Overloaded Operators vs. Procedural Interface

Another nice C++ feature that became popular in most modern applications due to its clarity is the use of overloaded operators. However, as a general rule, overloaded operators bloat the code as well. A fast SIMD vector library should also provide a procedural C-like interface.

The code bloat generated from overloaded operators does not arise when dealing with simple math expressions. This is probably why most developers ignore this problem when first designing a vector library. However, as the expressions get more complex the optimizer has to do extra work to assure correct results. This involves creating unnecessary temporary storage variables that usually translate into more load/store operations between SIMD registers and memory.

Let's try to compile a 3-band equalizer filter using overloaded operators and a procedural interface as an example. Then we look at the generated assembly code. The source code written with overloaded operators looks like:

Vec4 do_3band(EQSTATE* es, Vec4& sample)
{
Vec4 l,m,h;

es->f1p0 += (es->lf * (sample - es->f1p0)) + vsa;
es->f1p1 += (es->lf * (es->f1p0 - es->f1p1));
es->f1p2 += (es->lf * (es->f1p1 - es->f1p2));
es->f1p3 += (es->lf * (es->f1p2 - es->f1p3));
l = es->f1p3;

es->f2p0 += (es->hf * (sample - es->f2p0)) + vsa;
es->f2p1 += (es->hf * (es->f2p0 - es->f2p1));
es->f2p2 += (es->hf * (es->f2p1 - es->f2p2));
es->f2p3 += (es->hf * (es->f2p2 - es->f2p3));
h = es->sdm3 - es->f2p3;
m = es->sdm3 - (h + l);

l *= es->lg;
m *= es->mg;
h *= es->hg;

es->sdm3 = es->sdm2;
es->sdm2 = es->sdm1;
es->sdm1 = sample;
return(l + m + h);
}

Now the same code re-written using a procedural interface:

Vec4 do_3band(EQSTATE* es, Vec4& sample)
{ Vec4 l,m,h;

es->f1p0 = VAdd(es->f1p0, VAdd(VMul(es->lf, VSub(sample, es->f1p0)), vsa));
es->f1p1 = VAdd(es->f1p1, VMul(es->lf, VSub(es->f1p0, es->f1p1)));
es->f1p2 = VAdd(es->f1p2, VMul(es->lf, VSub(es->f1p1, es->f1p2)));
es->f1p3 = VAdd(es->f1p3, VMul(es->lf, VSub(es->f1p2, es->f1p3)));
l = es->f1p3;

es->f2p0 = VAdd(es->f2p0, VAdd(VMul(es->hf, VSub(sample, es->f2p0)), vsa));
es->f2p1 = VAdd(es->f2p1, VMul(es->hf, VSub(es->f2p0, es->f2p1)));
es->f2p2 = VAdd(es->f2p2, VMul(es->hf, VSub(es->f2p1, es->f2p2)));
es->f2p3 = VAdd(es->f2p3, VMul(es->hf, VSub(es->f2p2, es->f2p3)));
h = VSub(es->sdm3, es->f2p3);
m = VSub(es->sdm3, VAdd(h, l));

l = VMul(l, es->lg);
m = VMul(m, es->mg);
h = VMul(h, es->hg);

es->sdm3 = es->sdm2;
es->sdm2 = es->sdm1;
es->sdm1 = sample;

return(VAdd(l, VAdd(m, h)));
}

Finally let's look at the assembly code of both (click here to download the table document; refer to Table 2.)

The code that used procedural calls was approximately 21 percent smaller. Also notice that this code is equalizing four streams of audio simultaneously. So, compared to doing the same calculation on the FPU the speed boost would be tremendous.

I shall emphasize that I do not discourage the support for overloaded operators on a vector library. Indeed I encourage providing both so that developers can write procedurally if the expressions get complex enough to create code bloat.

In fact when I was testing VMath, I often wrote first my math expressions using overloaded operators. Then I tested the same code using procedural calls to see if any difference appeared. The code bloat was dependent upon the complexity of the expressions as stated before.

5. Inline Everything

The inline key word allows the compiler to get the rid of expensive function calls and therefore optimizing the code at the price of more code bloat. It's a common practice to inline all vector functions so that function calls are not performed.

Windows compilers (Microsoft & Intel) are really impressive about when to or not to inline functions. It's best to leave this job to the compiler, so there is nothing to lose by adding inline to all your vector functions.

However one problem with inline is that the instruction cache misses (I-Cache). If the compiler is not good enough to realize when a function is "too big" for the target platform, you can fall on a big problem. The code will not only bloat but also make lots of I-Cache misses. Then vector functions can go much slower.

I could not come up with one case in Windows development that had an "inlined" vector function which caused code bloat to a point that the I-Cache was affected. In fact, even when I don't inline functions the windows compilers are smart enough to inline them for me. However, when I did PSP and PS3 development the GCC port was not so smart. Indeed there had been cases that was better not to inline.

But if you fall into these specific cases, it's always trivial to wrap your inline function to a platform-specific call that is not "inlined" such as:

Vec4 DotPlatformSpecific(Vec4 va, Vec4 vb)
{
return (Dot(va, vb));
}

On the other hand, if the vector library is designed without inline you cannot force them to inline easily. So you rely on the compiler to be smart enough to inline the functions for you. Which may not work as you expect on different platforms.

6. Replicate Results Into SIMD Registers and Provide Data Accessors.

When working with SIMD instructions doing one calculation and four takes the same time. But what if you don't need the result of the four registers? It's still to your advantage to replicate the result into the SIMD register, if the implementation is the same. By replicating results into the SIMD quad-word, it can help the compiler optimize vector expressions.

For similar reasons it's important to provide data accessors such as GetX, GetY, GetZ, and GetZ. By providing this type of interface, and assuming the developer uses it, the vector library can minimize expensive casting operations between SIMD registers and FPU registers.

Shortly I will discuss a classic example that illustrates this problem.


Article Start Previous Page 2 of 5 Next

Related Jobs

Disbelief
Disbelief — Cambridge, Massachusetts, United States
[06.11.21]

Programmer
Disbelief
Disbelief — Cambridge, Massachusetts, United States
[06.11.21]

Senior Programmer
Insomniac Games
Insomniac Games — Burbank, California, United States
[06.11.21]

Technical Artist - Pipeline
Insomniac Games
Insomniac Games — Burbank, California, United States
[06.11.21]

Technical Artist - Pipeline





Loading Comments

loader image