| |
|
|
||||
![]() |
||||||
| |
|
|||||
|
Examples
This section provides some examples of Streaming SIMD Extensions usage, compared with equivalent scalar code. Using SIMD code First, lets take a look on one of the most obvious and fundamental operator in the library – multiplying a vector by a matrix. Scalar
Code
If we take a closer look in the scalar multiplication process, we can see that we can calculate the whole vector at once: In the scalar code, Vec.x is multiplied with the first four elements of the matrix. Those four elements are represented as the first line of the matrix, and are already placed in one SIMD variable. So we only need to expand the X element of the vector and multiply it with the first line of the matrix. This is done in the first assignment (third line) of the SIMD code. Next we multiply the expanded Y, Z and W elements of the vector with the second, third and forth line of the matrix respectively, as for the first element. Note that the X element of the result vector is the first element in the sum of the four vectors calculated before, the Y element is the second and so on. Therefore, the final result is just the sum of all the four vectors, and you don’t even need to rearrange the results. This equivalent SIMD code takes only 4 multiplications and 3 sums. Since one SIMD calculation takes about half the time of four scalar calculations, the SIMD code runs more than twice as fast as the scalar code. Some Numbers This table shows the performance gain of using the Matrix Library, compared to Microsoft*‘s D3DXMATRIX class from D3DXMath.H of DirectX*7, which implements the same functions the scalar way. Please
note that this is a synthetic test, so those ratios may change when
measured in application.
I The
scalar version of the function is not inlined, so accurate numbers should
be ~10 cycles less. And
Finally – A Real Example
There are three versions for calculating the series:
The third version, which was written using inlined assembly, took much more time to write – about an hour or two. In addition, this version is not fully optimized – the optimized functions are just placed end to end, without cross procedural optimizations. The first two versions are quite readable and can be modified easily. The third version is written in assembly, and is therefore much harder to modify. See the source file Exponent.cpp which is part of the MatLib.zip file. The next table shows the average time an iteration takes for each version: Note that using the GPMatrix instead of scalar code gives an improvement of x2.85! Even if the scalar functions were fully inlined, the GPMatrix code would still be more than twice as fast. The results of the GPMatrix version are even better than the assembly version since the compiler did a better job of optimizing the inlined functions… However, even a fully optimized assembly code (i.e. hand optimizations between functions) won’t give much better results. Conclusion The source files of the library and of the last example can be found in MatLib.zip. This library demonstrates how a performance library can be written without hardly using any assembly. Usually, writing good assembly is more efficient than C/C++. In this case writing the library functions without inlined assembly allows the compiler to perform inter-procedural optimizations on your code. Links Calculating rotation matrix using fast approximation, is done with the sine function from the Approximate Math (AM) library, download it from http://developer.intel.com/design/pentiumiii/devtools/. Other
Resources Some Useful Links: Download an evaluation copy of Intel C/C++ Compiler at http://developer.intel.com/vtune/compilers/cpp/demo.htm. Vtune Analyzer can be used to measure the time consumed by function(s), and more. Download an evaluation copy at http://developer.intel.com/vtune/. Haim Barad has a Ph.D. in Electrical Engineering (1987) from the University of Southern California. His areas of concentration are in 3D graphics, video and image processing. Haim was on the Electrical Engineering faculty at Tulane University before joining Intel in 1995. Haim is a staff engineer and currently leads the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel. ________________________________________________________ |
|
|