|
Figure 4. Breaking up a 2D grid into tasks and an
individual cell adjacent to other tasks.
For example, in the simulation there is a diffusion step
followed by a force update step and later, an advection step. At each step the
grid is broken into pieces, and the tasks write their data to a shared output
grid, which becomes the input into the next step. In most cases, this approach
works well.
However, a problem can arise if the calculation uses its own
results in the current step. That is, the output of cell 1 is used to calculate
cell 2. Looking at Figure 4, it is obvious that certain cells, like the shaded
cell, will be processed in a task separate from the task in which their
immediate neighbors are processed.
Since each task uses the same input, the
neighboring cells' data will be different in the multi-threaded version
compared to the serial version. Whether this is an issue depends on which
portion of the simulation we are in.
Inaccuracies in the results in the
diffusion step can occur, but they are not noticeable and don't introduce
destabilizing effects. In most cases the inaccuracies are smoothed out by
subsequent iterations over the data and have no lasting impact on the
simulation. This kind of behavior is acceptable because the primary concern is
visual appeal, not numerical accuracy.
However, in other cases inaccuracies can
accumulate, introducing a destabilizing effect on the simulation, which can
quickly destroy any semblance of normal behavior. An example of this occurs
while processing the edge cases in diffusion-where diffusion is calculated
along the edges of the grid or cube.
When run in parallel, the density does not
diffuse out into the rest of the volume and instead accumulates at the edges,
forming a solid border. For
this situation and similar ones, we simply run that
portion serially.
The main computational load of the simulation occurs when
each step loops over the grid. These are all nested loops, so a straightforward
way to break them up is by using the "parallel for" construction of TBB. In the
example case of density diffusion, the code went from the serial version shown
in Figure 3 to the parallel version shown in Figures 5 and 6.
Figure 5. Using Intel Threading Building Blocks to call
the diffusion algorithm.
Figure 5 shows how TBB is used to call the diffusion
algorithm. The arguments uBegin and uEnd are the start and end values of the
total range to process, and uGrainSize is how small to break the range into.
For example, if uBegin is 0 and uEnd is 99, the total range is 100 units. If
uGrainSize is 10, 10 tasks will be submitted, each of which will process a
range of 10.
Figure 6 shows the
actual algorithm called by the TBB task manager. The range to process is passed
in, and the input and output grids are variables common to each job. As a
result of using multi-threading, the performance in frames per second was
improved by 3.76× when going from one thread to eight threads on an Intel Core
i7 processor.
Figure 6. A code
sample that shows 3D, multi-threaded diffusion.
Rendering
Although the goal of the project was to take unused CPU
cycles and use them to compute a CFD simulation, the results won't have much
impact without a nice, visual representation.
There are many methods of
rendering volumetric data, but we like the results we got using the ray casting
method detailed in Real-Time Volume Graphics.3 Using
this method, the volumetric data is copied into a 3D texture, which is used in
a pixel shader on the GPU.
The simulation volume is rendered as a cube at the
position of the volume in the 3D scene. For each visible pixel, a ray is cast
from the camera to this pixel on the cube. The corresponding point in the 3D
texture is found and sampled and the ray marches through the texture at a fixed
step size, accumulating the density data as it goes.
Conclusion
With Kaboom, we created a modular fluid simulation that
adds to the existing knowledge base. It operates in 3D, is independent of the
rendering portion, and is multi-threaded to take advantage of multi-core
processors.
Of course, there is still plenty of future work that someone can do
with this code sample. On the simulation side, a new or modified algorithm
could be used to remove the serial requirements.
Also, we found that a section
of the code that does bilinear interpolation calculations takes up a
significant amount of processing time, which could be reduced by storing the
results in a lookup table.
Another interesting avenue to explore is the introduction
of arbitrary boundaries. Since the simulation is running on the CPU it has
access to and can react with the geometry data of the scene. The rendering
could be enhanced with additional effects, such as shadowing or dynamic
lighting.
Resources
For more information about Intel Threading Building Blocks,
visit http://www.threadingbuildingblocks.org
3 Engel, Klaus, ed. Real-Time
Volume Graphics, Wellesley,
Massachusetts: A. K. Peters, Ltd., 2006.
|
So far I have found:
tbb.dll (Thread Building Blocks)
MSVCP80.dll (Some version of MSVC runtime)
Not knowing where to get them, I am unable to run the demo.
Cheers,
Jeff
Running the Application
-----------------------
Redistributables
----------------
If Visual Studio 2008 SP1 and the DirectX SDK (November 2008) are not installed, you
will need to download and install the following redistributables:
Visual C++ 9
http://www.microsoft.com/downloads/details.aspx?familyid=A5C84275-3B97-4AB7-A40D
-3802B2AF5FC2&displaylang=en
DirectX November 2008
http://www.microsoft.com/DOWNLOADS/details.aspx?displaylang=en&FamilyID=886acb56
-c91a-4a8e-8bb8-9f20f1244a8e
Intel(R) Threading Building Blocks
----------------------------------
Download a release of TBB from: http://www.threadingbuildingblocks.org/
and copy tbb.dll and tbbmalloc.dll from the download to the same directory containing
FluidRender2D.exe and FluidRender3D.exe.
We wanted to highlight a parallel programming approach without the need for a custom thread pool model. Often times such implementations become designed around functional versus task based approaches that otherwise artificially limit (undersubscribe) available CPU cores. The task based approach to parallelization taken here exploits TBB's ability to fully subscribe available CPU cores, taking equal advantage of available CPU horsepower on today's multi-core processors and beyond. TBB is an excellent fit here since it's Cilk-based task stealing approach lets the simulation scale with available CPU cores.
Cheers,
Jeff
Software Engineer/Visual Computing Software Division/Intel