It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.

Search articles, jobs, buyers guide, and more.

by Haim Barad
Gamasutra
January 31, 2000

Printer Friendly Version

Letters to the Editor:
Write a letter
View all letters


Features

 

Contents

Introduction

The Library

Examples

Examples

This section provides some examples of Streaming SIMD Extensions usage, compared with equivalent scalar code.

Using SIMD code

First, lets take a look on one of the most obvious and fundamental operator in the library – multiplying a vector by a matrix.

Scalar Code

The scalar and standard way is to calculate each element of the destination vector by multiplying the source vector with the appropriate column of the matrix. The computation takes 16 multiplications and 12 sums.
 
void scalarVectorMult (const GPVector &Vec, const GPMatrix &Mat, GPVector &Res) {
    Res.x = Vec.x*Mat._11 + Vec.y*Mat._21 + Vec.z*Mat._31 + Vec.w*Mat._41;
    Res.y = Vec.x*Mat._12 + Vec.y*Mat._22 + Vec.z*Mat._32 + Vec.w*Mat._42;
    Res.z = Vec.x*Mat._13 + Vec.y*Mat._23 + Vec.z*Mat._33 + Vec.w*Mat._43;
    Res.w = Vec.x*Mat._14 + Vec.y*Mat._24 + Vec.z*Mat._34 + Vec.w*Mat._44;
}
Figure 3. Vector Multiplication – Scalar code

 
SIMD Code

void VectorMult(const GPVector &Vec, const GPMatrix &Mat, GPVector &Res) {
    F32vec4 Result;
    Result = F32vec4(Vec.x) * Mat._L1;
    Result += F32vec4(Vec.y) * Mat._L2;
    Result += F32vec4(Vec.z) * Mat._L3;
    Result += F32vec4(Vec.w) * Mat._L4;
    Res = Result;

}
Figure 4. Vector Multiplication – SIMDified code

If we take a closer look in the scalar multiplication process, we can see that we can calculate the whole vector at once: In the scalar code, Vec.x is multiplied with the first four elements of the matrix. Those four elements are represented as the first line of the matrix, and are already placed in one SIMD variable. So we only need to expand the X element of the vector and multiply it with the first line of the matrix. This is done in the first assignment (third line) of the SIMD code. Next we multiply the expanded Y, Z and W elements of the vector with the second, third and forth line of the matrix respectively, as for the first element. Note that the X element of the result vector is the first element in the sum of the four vectors calculated before, the Y element is the second and so on. Therefore, the final result is just the sum of all the four vectors, and you don’t even need to rearrange the results.

This equivalent SIMD code takes only 4 multiplications and 3 sums. Since one SIMD calculation takes about half the time of four scalar calculations, the SIMD code runs more than twice as fast as the scalar code.

Some Numbers

This table shows the performance gain of using the Matrix Library, compared to Microsoft*‘s D3DXMATRIX class from D3DXMath.H of DirectX*7, which implements the same functions the scalar way.

Please note that this is a synthetic test, so those ratios may change when measured in application.
 
Function Scalar Code Matrix Library ratio
 Vector Multiplication   60I  26 2.30
 Matrix Multiplication  282I  87 3.26
 Inverse Matrix 328 170 1.92
 Make Rotation Matrix 169  143II 1.18

 I The scalar version of the function is not inlined, so accurate numbers should be ~10 cycles less.
II There is another version of this function that uses fast approximations, and takes only 112 cycles.

And Finally – A Real Example

The last example presents code for calculating an exponent of a matrix.An exponent of a real number can be calculated using Taylor Series. In the same way, an exponent of matrix is defined:


There are three versions for calculating the series:

  1. Using scalar code.
  2. Using the GPMatrix class.
  3. All the functions as inlined assembly (with no cross optimizations between the functions).
The first and second versions are written using the classes’ operands. Actually I wrote them in about ten minutes.
The third version, which was written using inlined assembly, took much more time to write – about an hour or two. In addition, this version is not fully optimized – the optimized functions are just placed end to end, without cross procedural optimizations.

The first two versions are quite readable and can be modified easily. The third version is written in assembly, and is therefore much harder to modify. See the source file Exponent.cpp which is part of the MatLib.zip file.

The next table shows the average time an iteration takes for each version:

Version Average Time
 Scalar code 371
 GPMatrix code 130
 Inlined assembly 144
Note that using the GPMatrix instead of scalar code gives an improvement of x2.85! Even if the scalar functions were fully inlined, the GPMatrix code would still be more than twice as fast.

The results of the GPMatrix version are even better than the assembly version since the compiler did a better job of optimizing the inlined functions… However, even a fully optimized assembly code (i.e. hand optimizations between functions) won’t give much better results.

Conclusion

The source files of the library and of the last example can be found in MatLib.zip.

This library demonstrates how a performance library can be written without hardly using any assembly. Usually, writing good assembly is more efficient than C/C++. In this case writing the library functions without inlined assembly allows the compiler to perform inter-procedural optimizations on your code.

Links

Calculating rotation matrix using fast approximation, is done with the sine function from the Approximate Math (AM) library, download it from http://developer.intel.com/design/pentiumiii/devtools/.

Other Resources
Two related application notes:
AP-928 - Inverse of 4x4 Matrix
AP-930 - Matrix Multiplication

Some Useful Links:

Download an evaluation copy of Intel C/C++ Compiler at http://developer.intel.com/vtune/compilers/cpp/demo.htm.

Vtune Analyzer can be used to measure the time consumed by function(s), and more. Download an evaluation copy at http://developer.intel.com/vtune/.

Haim Barad has a Ph.D. in Electrical Engineering (1987) from the University of Southern California. His areas of concentration are in 3D graphics, video and image processing. Haim was on the Electrical Engineering faculty at Tulane University before joining Intel in 1995. Haim is a staff engineer and currently leads the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel.

________________________________________________________

[Back to] Introduction


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2003 CMP Media LLC

privacy policy
| terms of service