
Optimizations
Corner: An Optimized Matrix Library in C++
By
Haim
Barad
Gamasutra
January
31, 2000
URL: http://www.gamasutra.com/features/20000131/barad_01.htm
Last month we discussed microarchitectural "snags" that can slow down your code. This month, we're going to get back on the fast track and show you a set of optimized matrix routines. As you know, SIMD architectural enhancements done to recent microprocessors (MMX Technology and Streaming SIMD Extensions) are a perfect fit for matrix and vector operations. This month, Zvi Devir (a student member of my team studying Math and CS at The Technion in Haifa, Israel) will explain and present a matrix library for 3D graphics fully optimized for the Pentium® III processor. The library was written mostly in C++ and compiled with the Intel C++ compiler. The full sources are made completely available.
Introduction
This article is about one of the most common elements in 3D graphics:3D transformation matrices and vectors. As we all know, 3D graphics uses [x,y,z] or [x,y,z,w] vectors to represent points in the three-dimensional space or in the homogeneous space, respectively. You can manipulate these vectors by multiplying them with 4x4 transformation matrices. The standard manipulation matrices are translation, rotation and scaling.
For example, to rotate a vector by 90° around the X axis and then move it 20 units toward the positive infinity of the Y axis, create a rotation matrix and a translation matrix, and multiply the vector with both of them. It is simpler and faster to multiply both the matrices in advance. Later, you can multiply the vector with the product when needed.
However, standard multiplication of one matrix by another takes 64 products and 48 sums. A standard multiplication of a vector by a matrix takes 16 products and 12 sums.
This article presents a better way to manipulate vectors and matrices using Intel’s Streaming SIMD Extensions, to enable the calculation of 4 products or 4 additions at once. The SIMD register can hold 4 floating-point numbers with single precision. Hence one register can hold a full vector, and four registers can hold a full 4x4 matrix.
The Library
Using the Streaming SIMD Extensions, you can complete a matrix multiplication with only 16 products and 12 additions. The library provided in this article was written with the goal to get the most out of the Streaming SIMD Extensions, and to reduce the amount of time needed for matrix/vector operations. Most of the library is written in C++, except for one small section in assembly. The library’s functions are about twice as fast as the equivalent scalar version of those functions.The library includes three classes:
GPMatrix class
The GPMatrix class is a 4x4 matrix of floats.
|
union { struct { __m128 _L1, _L2, _L3, _L4; }; struct { float _11, _12, _13, _14; float _21, _22, _23, _24; float _31, _32, _33, _34; float _41, _42, _43, _44; }; }; // ... }; |
|
Figure.
1 The GPMatrix Class
|
The class
contains 16 float elements (_11 to _44). The elements are placed in four lines,
where each line is represented as one SIMD variable (_L1 to _L1).
Data elements can be referenced by their row and column:
There is one limitation
for the use of this class:
The class must be 16-byte aligned, so it won’t split between too many cache
lines, and to enable faster reading/writing of the class’ elements.
However, the class is aligned automatically by the Intel
C++ compiler.
The GPVector
class
The GPVector class is a vector of 4 floats.
|
union { __m128 vec; struct { float x,y,z,w; }; }; // ... }; |
|
Figure
2. The GPVector Class
|
The GPVector class has the x,y,z and w elements of the vector as floats, and also represented as one SIMD variable. As with the GPMatrix class, the GPVector class must be 16-bytes aligned.
A variant of the GPVector class is the GPVector3 class, which does not have the w element. This class holds "pure" 3D vectors. However, for alignment and for other reasons, the w element, which is not used, is replaced with a spacer.
Constructors & Operators
Operators on GPMatrix:
GPMatrix Constructors:
A * B matrices multiplication A ± B matrices addition/subtraction ±A matrix unary minus/plus A * s matrix multiplication with scalar A *= B matrix multiplied by matrix A *= s matrix multiplied by scalar A ±= B matrix added/subtracted by matrix matrix transpose matrix inverse matrix determinant matrix min/max element
Operators on GPVector:
Identity matrix Zero matrix Rotation matrices (around the X axis, Y axis and Z axis) Translation matrices Scaling matrices
v * M vector multiplication with matrix v * s vector multiplication with scalar v * w vectors dot (inner) product v % w vectors cross product (in 3D) v ± w vectors addition/subtraction ± v vector unary minus/plus ~ v vector normalization v *= M vector multiplied by matrix v *= s vector multiplied by scalar v ±= w vector added/subtracted by vector
Examples
This section provides some examples of Streaming SIMD Extensions usage, compared with equivalent scalar code.
Using SIMD code
First, lets take a look on one of the most obvious and fundamental operator in the library – multiplying a vector by a matrix.
Scalar Code
The scalar and standard way is to calculate each element of the destination
vector by multiplying the source vector with the appropriate column of the matrix.
The computation takes 16 multiplications and 12 sums.
| void scalarVectorMult (const
GPVector &Vec, const GPMatrix &Mat, GPVector &Res) { Res.x = Vec.x*Mat._11 + Vec.y*Mat._21 + Vec.z*Mat._31 + Vec.w*Mat._41; Res.y = Vec.x*Mat._12 + Vec.y*Mat._22 + Vec.z*Mat._32 + Vec.w*Mat._42; Res.z = Vec.x*Mat._13 + Vec.y*Mat._23 + Vec.z*Mat._33 + Vec.w*Mat._43; Res.w = Vec.x*Mat._14 + Vec.y*Mat._24 + Vec.z*Mat._34 + Vec.w*Mat._44; } |
|
Figure
3. Vector Multiplication – Scalar code
|
SIMD Code
| void VectorMult(const
GPVector &Vec, const GPMatrix &Mat, GPVector &Res) { F32vec4 Result; Result = F32vec4(Vec.x) * Mat._L1; Result += F32vec4(Vec.y) * Mat._L2; Result += F32vec4(Vec.z) * Mat._L3; Result += F32vec4(Vec.w) * Mat._L4; Res = Result; } |
|
Figure
4. Vector Multiplication – SIMDified code
|
If we take a closer look in the scalar multiplication process, we can see that we can calculate the whole vector at once: In the scalar code, Vec.x is multiplied with the first four elements of the matrix. Those four elements are represented as the first line of the matrix, and are already placed in one SIMD variable. So we only need to expand the X element of the vector and multiply it with the first line of the matrix. This is done in the first assignment (third line) of the SIMD code. Next we multiply the expanded Y, Z and W elements of the vector with the second, third and forth line of the matrix respectively, as for the first element. Note that the X element of the result vector is the first element in the sum of the four vectors calculated before, the Y element is the second and so on. Therefore, the final result is just the sum of all the four vectors, and you don’t even need to rearrange the results.
This equivalent SIMD code takes only 4 multiplications and 3 sums. Since one SIMD calculation takes about half the time of four scalar calculations, the SIMD code runs more than twice as fast as the scalar code.
Some Numbers
This table shows the performance gain of using the Matrix Library, compared to Microsoft*‘s D3DXMATRIX class from D3DXMath.H of DirectX*7, which implements the same functions the scalar way.
Please note that
this is a synthetic test, so those ratios may change when measured in application.
| Function | Scalar Code | Matrix Library | ratio |
| Vector Multiplication | 60I | 26 | 2.30 |
| Matrix Multiplication | 282I | 87 | 3.26 |
| Inverse Matrix | 328 | 170 | 1.92 |
| Make Rotation Matrix | 169 | 143II | 1.18 |
I The
scalar version of the function is not inlined, so accurate numbers should be
~10 cycles less.
II There is another version of this function that uses fast
approximations, and takes only 112 cycles.
And Finally
– A Real Example
The last example presents code for calculating an exponent of a matrix.An exponent
of a real number can be calculated using Taylor Series. In the same way, an
exponent of matrix is defined:
There are three versions for calculating the series:
The first two
versions are quite readable and can be modified easily. The third version is
written in assembly, and is therefore much harder to modify.
See the source file Exponent.cpp which is part of the MatLib.zip
file.
The next table shows the average time an iteration takes for each version:
Note that using the GPMatrix instead of scalar code gives an improvement of x2.85! Even if the scalar functions were fully inlined, the GPMatrix code would still be more than twice as fast.
Version Average Time Scalar code 371 GPMatrix code 130 Inlined assembly 144
The results of the GPMatrix version are even better than the assembly version since the compiler did a better job of optimizing the inlined functions… However, even a fully optimized assembly code (i.e. hand optimizations between functions) won’t give much better results.
Conclusion
The source files of the library and of the last example can be found in MatLib.zip.
This library demonstrates how a performance library can be written without hardly using any assembly. Usually, writing good assembly is more efficient than C/C++. In this case writing the library functions without inlined assembly allows the compiler to perform inter-procedural optimizations on your code.
Links
Calculating rotation matrix using fast approximation, is done with the sine function from the Approximate Math (AM) library, download it from http://developer.intel.com/design/pentiumiii/devtools/.
Other Resources
Two related application notes:
AP-928
- Inverse of 4x4 Matrix
AP-930
- Matrix Multiplication
Some Useful Links:
Download an evaluation copy of Intel C/C++ Compiler at http://developer.intel.com/vtune/compilers/cpp/demo.htm.
Vtune Analyzer can be used to measure the time consumed by function(s), and more. Download an evaluation copy at http://developer.intel.com/vtune/.
Haim Barad has a Ph.D. in Electrical Engineering (1987) from the University of Southern California. His areas of concentration are in 3D graphics, video and image processing. Haim was on the Electrical Engineering faculty at Tulane University before joining Intel in 1995. Haim is a staff engineer and currently leads the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel.
Copyright © 2003 CMP Media Inc. All rights reserved.