Multi-Threaded Terrain Smoothing
May 31, 2006 Page 1 of 2
Multi-core processors are going mainstream. Most PCs sold today and also all new game consoles allow the parallel execution of several hardware threads. For games programmers it would be a real shame to have all this additional computational power available and not use it. The new consoles do not seem to have out-of-order cores (see [Stokes05]). Threading is therefore the only way to really use all available execution units in parallel.
Creating a multi-threaded game engine is a challenging task (see e.g. [Gabb05]). Various game tasks like rendering, Artificial Intelligence (AI) computations, physics or collision detection often rely on the sequential availability of results from other game tasks. If game engine developers choose to run several different game tasks, like physics and rendering, in parallel (task level parallelism) it can be complicated to have required results from other tasks available in time. After all one does not want to add too much synchronization overhead. The other possibility is to work on several problems of the same kind in a data parallel way (e.g. path finding for 4 characters in parallel). To be as efficient and as scalable as possible your game engine should ideally be implemented using a mixture of task level and data level parallelism. This guarantees scalability for more cores to be available.
Concerning multi-threaded software development there are nice documents available for download that describe all relevant concepts and also most coding pitfalls (see e.g. [DevMTApps]). Also it should be noted that tools like VTune* (for finding hotspots), ThreadCecker* (for finding race conditions and other threading bugs) and ThreadProfiler* (for finding load balancing and other performance issues) are valuable tools to help your multi-threaded development effort (see [IntelTools]). The threading problems you solve on your PC with their help will most likely help you on other multi-core platforms as well.
[West06] has written an introduction presenting ideas of how to take advantage of multiple cores in game engines. This article takes his approach one step ahead and describes the use of multi-threading to realize real-time terrain height field smoothing (see Figure 1). The visible part of the terrain is smoothed in real-time every frame to achieve smooth LOD transitioning without visible geo-morphing or popping. Also this way you make sure that you spend your vertex budget for the terrain in an almost optimal way.
It has to be mentioned that the terrain smoothing described here could also be done on the graphics card using floating point render targets (see [Bunnell05]) or ‘Tessellation through Instancing’ (see [Gruen05]). Also DirectX10* and GeometryShaders are just around the corner. Still, most games are processor limited on low to medium display resolutions, so offloading the processor tasks to other cores can help. Again most games are graphics-card-limited at higher display resolutions. So with higher display resolution it makes sense to shift load away from the graphics card as you might desperately need your graphics card power for other things. Of course all this depends on how you can balance your processor and graphics card workloads. If you do rendering in multiple passes on pre DirectX10* hardware, tessellation on the processor can be faster than doing it potentially many times on the graphics card.
The use of OpenMP* (see [OpenMP]) to realize data parallel smoothing is investigated as a first approach. Since it is harder to use OpenMP* to realize asynchronous threading it is also shown how to use the Windows* threading API to realize an asynchronous tessellation that only picks up a new terrain mesh when it is done. This latter approach has no impact on the frame time at all. The algorithm described can generate balanced workloads for even a big number of cores and therefore scales nicely. You can use it to let just one thread or one free core do the heavy lifting of smoothing or distribute smaller workloads to cores that you know are not fully utilized.
Next this article discusses how a sparse terrain height field can be turned into an overall smooth surface using Bezier patches. After that it is described how culling, LOD and thread workload selection can be realized and how a big and watertight mesh can be generated. Finally the discussion on how to multi-thread the smoothing work concludes the article.
For simplicity reasons this article chooses one dimensional rectangular bi-cubic Bezier patches to smooth a sparse height field. These patches are only used to smooth the height values across the height field, the other coordinates are generated by linear interpolation. The following text assumes, because of brevity reasons, that the reader is familiar with how Bezier patches work. See [Farin96] for a well written introduction.
Figure 1 shows a 4 by 4 neighborhood within a bigger height field. The left and bottom corner of this neighborhood is at location (x,y) (both are integer values) in the height field. The height values, indicated by thin vertical lines, are meant to be different but are shown as roughly equal to make the drawing more readable. The patch p shown at the center of the height field is a representative the grid cells defined by the height field. The grid cell for p is defined by h(x+1,y+1), h(x+2,y+1), h(x+1,y+2) and h(x+2,y+2).
In order to replace p by a bi-cubic Bezier patchone needs to define the 16 control heights. These control heights are indicated by b00, b10, b20, b30, b01, b11, b21, b31, b02, b12, b22, b32, b03, b13, b23 and b33.
Because of brevity reasons – after all the focus of this article is on multi-threading – Formula 1 simply presents one algorithm to setup the control heights to achieve an overall C1 smooth terrain surface.
How does one now evaluate the surface? The key insight is that each patch defines four vertical Bezier curves defined by four control heights each. The curves are:
- Curve 0 : Defined by b00, b01, b02 and b03
- Curve 1 : Defined by b10, b11, b12 and b13
- Curve 2 : Defined by b20, b21, b22 and b23
- Curve 3 : Defined by b30, b31, b32 and b33
Each curve can be written as a cubic polynomial with coefficients that can be derived from the control heights (see [Foley90]). Forward differencing (see [Foley90]) can be used to step through each of the curves in constant steps. The implementation of the demo for this article does this with SSE (see e.g. [Klimovitski01]) compiler intrinsics. It actually steps all four curves simultaneously using instruction level parallelism. For each step, forward differencing gives us four height values. These four height values are again used as four control heights which now define a curve that runs across the patch in the horizontal direction. Again forward differencing is used to step along this horizontal curve. This time SSE is used to transform the three adds of an FFD step to one 4 dimensional add and one shuffle operation. Directional derivatives are stepped in a similar manner. It has to be noted though that the code only generates approximate derivates along the vertical direction.
Culling and LOD Selection
Now assume one has some efficient culling algorithm that quickly identifies the cells of the height field inside the current view frustum. The output of this culling algorithm is a list of potentially visible grid cells.
The goal is to tessellate patches near the viewer with more details and patches far away from the viewer with less detail. The implementation in the demo chooses a level of detail (LOD) value for every grid cell. This LOD value is based on the distance of the center of the grid cell from the viewer. The LOD is chosen as a floating point number that defines how many points the FFD algorithm generates along each edge of the patch. The non-integral part of the LOD is defined in the same way that D3D defines segment counts for curved surface patches (see [D3D05]). This ensures that a moving camera generates a very smooth view dependent LOD transitioning since there is no popping of discrete LODs.
The demo generates one big triangle strip for the whole of the tessellated terrain. It also generates extra triangles that stitch the gaps between neighboring patches. Using an LOD per edge as [D3D05] does would be more advanced because stitching triangle would not be needed. The per-cell LOD approach was used because the setup for forward differencing is easier to code this way.
Now that the smoothing mathematics, culling and mesh generation have been covered the text moves on to describe how to generate a balanced workload for multiple threads to do the actual tessellation work.
Creating a Balanced Workload for Multi-Threading
The goal is to distribute the tessellation computations collected by the culling algorithm evenly to a set of hardware threads. To distribute an even workload one needs some value that specifies how complex a certain tessellation task is. Luckily the LOD value of each grid cell describes how expensive it will be to tessellate the corresponding patch. What is also nice is that the LOD values usually fall into a very limited range. In the demo implementation LODs range from 2.0 to 30.0.
Therefore an algorithm collects all cells with the same integral LOD into a separate list and essentially sorts patches to be tessellated into buckets with roughly the same tessellation work to be done. Given a number of worker threads the code generates an initially empty workload list for every thread. Now starting from the buckets with the most complex workload, in a round robin fashion, the workload is distributed to the threads. This process is outlined by the following pseudo code Code Fragment 1.
|Code Fragment 1|
You can further enhance this algorithm by adding a per thread value that is used to weight the current thread workload when comparing it with the current maximum workload. This allows the generation of controlled uneven workloads with the same algorithm.
From the LOD value associated with every cell it is also very easy to compute how many vertices will be produced by a certain tessellation workload. When offloading workloads to several threads each thread will also be given an offset along with the starting address of a vertex buffer that has been locked by the main thread of the application. This way threads can write their tessellation results into non-overlapping memory regions.
It has been shown how to generate a workload distribution for a number of threads. Next it will be discussed how to architect multi-threaded code that picks up these workloads. First the use of OpenMP* is discussed and afterwards the use of the Windows* threading API is described.
Page 1 of 2