Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Gamasutra: The Art & Business of Making Gamesspacer
Hardware Accelerating Art Production
arrowPress Releases
September 30, 2020
Games Press
View All     RSS

If you enjoy reading this site, you might also want to check out these UBM Tech sites:


Hardware Accelerating Art Production

March 19, 2004 Article Start Previous Page 3 of 3


The Direct3D Extensions Library (D3DX) provides a number of functions for software processing of SH PRT as part of the latest public release (DirectX SDK Update Summer 2003). Since this implementation is both robust and readily available, it is an ideal reference for comparison with the hardware version presented earlier.

While not exhaustive, Table 1 shows a trend in the performance between D3DX and our hardware setup for 9-component vertex PRT. There is a clear gulf between the two versions as the number of surface elements increases--something which bodes well for texture processing.

Two sets of timings are listed for the hardware (HW1 and HW2), which differ only in the number of samples taken per vertex. The former is for direct comparison with D3DX; twice as many samples are used since the hardware method samples over a sphere, so on average only half the samples contribute anything (that's quite a contrast to hemispherical sampling in software). Bear in mind that this still isn't an entirely fair comparison, as D3DX importance samples via a cosine distribution, which speeds up convergence.

HW2 is supplied for a rough idea of the sort of performance to expect when previewing using a lower number of samples. The optimum will vary from scene to scene, however, depending on visibility variance.



D3DX (2048 samples)

HW1 (4096 samples)

HW2 (1024 samples)




















Table 1: results

These figures were recorded on an Athlon XP 2400+ PC equipped with a GeforceFX 5800 using a release build and libraries. To put the hardware version in the best light, meshes were "vertex cache order" optimized beforehand, although this improves locality of reference for software as well. The depth resolution for hardware processing was 512x512x8 bits in all cases.

Visually there is a minor difference in lighting (Figure 8) with hardware processing--caused by the limited storage precision of samples--which in my opinion is acceptable for previews. Precision and read-back can be traded off via the split ramp as required, but both the number of iterations and the depth map (resolution and precision) have a greater affect on the quality of the results.

Figure 8: Output from software and hardware processing (respectively), (left) D3DX, (right) HW1

The only major differences are the shadows under the eyelids (Figure 9), which are not captured by the hardware version. This is due to the limited depth precision, which can be increased as described earlier. A reduced depth bias combined with nudging vertices outwards (along the normal) may also improve accuracy. It should also be noted that the cyan color, present in transition areas between the red and white lights, is due to the low-order SH lighting approximation.

Figure 9: A close up showing the lack of depth resolution in the hardware implementation compared to software: (left) D3DX, (right) HW1 - note the missing shadow under the eyelids caused by the 8bit depth map.

[Sloan04] reports similar results to these with floating point hardware. His accelerated PRT implementation--essentially the same as the one presented in this feature but using version 2.0 shaders and high-precision buffers--will be available in an upcoming DirectX SDK Update along with significant optimizations to the software simulator. The revised API will also be more modular, enabling (among other things) general CPCA compression of user-created data, as produced here.


It might also be possible to map one or more of the versions presented earlier onto fixed-function hardware. Whether such cards have the necessary power to out-do a clever software implementation is unclear however, and operations may need to be split over extra passes, cutting performance further.

One trick that may improve vertex AO performance is adaptive ray sampling. A simple form of this would be to examine vertices after some fixed number of iterations, find those that appear to be fully visible or within some margin of error, then through a dynamic index buffer (or other method) and extra book-keeping, spare these vertices further processing. A more general solution could look at differences between blocks of iterations, terminating at a given error threshold. It's an interesting idea, but extra communication between host and GPU may counter any potential speed gain.

Other Applications

The following are a few other examples of hardware-accelerated preprocesses, some of which have already been employed in game development, and others that could be used.


Coombe et al. [Coombe03] map a progressive radiosity method almost completely onto the GPU and in the process their method solves a couple of the classic problems with hemi-cubes. Firstly the hemi-cube faces no longer need to be read back to main memory for software processing of delta form factors. Rather than iterating over the faces and randomly updating elements--an infeasible task with current hardware--surface elements, maintained directly in textures, are instead inverse-transformed to the ID hemi-cube faces. Secondly, since IDs now only need to be assigned to patches rather than elements, aliasing--a problem with hemi-cubes when shooting--is also reduced.

The paper is an enlightening read, with clever tricks devised for face shooting selection and adaptive subdivision. From the point of view of acceleration, their system reaches a solution rapidly for simple scenes and intermediate results are available at any point due to the progressive nature and texture storage.

Static SH Volume

Max Payne 2 uses a static volume of SH estimates over a game level, for environmental lighting of models at run-time. Remedy accelerates the offline computation [Lehtinen04]--in particular the SH projection of environment maps through ps2.0 shaders and floating-point storage.

At a given grid point, the pre-lit scene is first rendered to an HDR cube-map. Face texels are then multiplied by the current basis function evaluated in the corresponding direction. Naturally, since these directional terms are fixed for all cubes, they can be pre-calculated and read from a texture. The resulting values are then accumulated via reductive summing of groups of four neighboring texels through multiple passes--a process akin to repeated box filtering, just without the averaging. The final level is then read back and scaled appropriately, yielding an SH coefficient. Projection is repeated for all coefficients and rendering for all grid points.

Normal Mapping

Wang et al. [Wang03] describe an image-space method to accelerate normal map processing using graphics hardware, which has similarities to the first AO scheme described earlier. The reference mesh is rendered from a number of viewpoints and depth comparisons are made in software to determine the nearest surface point and normal.

The authors work around the problem of hidden surface points in complex meshes by falling back to interpolating target triangle normals. The process outlined in the paper relies on reading back the frame buffer, containing the reference normals, and the depth buffer. It's quite possible that vertex and fragment programs could be used, as in the case study, to move more of the work to the GPU, thereby accelerating the process further.

Christian Seger, author of ORB, takes a different approach to normal map acceleration [Seger03] that avoids the issue of hidden surface points. For a given triangle of the LP model, triangles from the HP reference mesh within a search region are first of all culled. Rather than simply planar projecting these triangles, they are shrunk down based on the interpolated normal of the target triangle. In ray-tracing terms this emulates normal, rather than nearest point, sampling, which reduces artifacts--see [Sander00]. Using the coordinates calculated from the shrinking process, hardware then renders these triangles to the normal map, with stenciling used to clip away any texels outside of the target triangle in texture-space.


This feature, through the case study and other examples, has hopefully convinced you that that programmable graphics can play a role in developing more responsive art tools. Furthermore, the potential for acceleration is not restricted to the very latest floating point graphics processors.

It's true that software is ultimately more general, often easier to debug and extend and may offer greater accuracy. But GPU restrictions are falling away, shader debugging is becoming easier and as the results show, high numerical accuracy isn't a prerequisite for previews.

When a process can be mapped efficiently onto graphics hardware, the speed increase can be significant, and GPUs are scaling up faster than CPUs in terms of raw performance. The implementation can also be simpler than a comparable software version when the latter needs extra data structures, algorithms and lower-level optimizations to cut processing time.

Future shader versions will clear the way for mapping a larger class of algorithms onto graphics processors and higher-level abstractions such as BrookGPU are another welcome development. Faster communication through PCI-Express will also make hybrid solutions more viable.


I would like to thank Simon Brown, Willem de Boer, Heine Gundersen, Peter McNeill, David Pollak, Richard Sim and Neil Wakefield for comments and support; Peter-Pike Sloan and Rune Vendler for detailed feedback, information and ideas; Jaakko Lehtinen and Christian Seger for describing their respective hardware processing schemes; the guys at Media Mobsters for regular testing of unreliable code on a range of GPUs; Simon Green and NVIDIA for the Ogre stills and permission to use the head mesh, modeled by Steve Burke; Microsoft for permission to publish details of upcoming D3DX features.


[Brown03] Brown, S, How To Fix The DirectX Rasterisation Rules, 2003.
[Cignoni98] Cignoni, P, Montani, C, Rocchini, C, Scopigno, R, A general method for preserving attribute values on simplified meshes, IEEE Visualization, 1998.
[Coombe03] Coombe, G, Harris, M J, Lastra, A, Radiosity on Graphics Hardware, June 2003.
[Forsyth03] Forsyth, T, Spherical Harmonics in Actual Games, GDC Europe 2003.
[James03] James, G, Rendering Objects as Thick Volumes, ShaderX2 2003.
[Landis02] Landis, H, Production-Ready Global Illumination, Siggraph course notes #16, 2002.
[Lehtinen04] Lehtinen, J, personal communication, 2004.
[Purvis03] Purvis, I, Tokheim, L, Real-Time Ambient Occlusion, 2003.
[Sander00] Sander, P, Gu, X, Gortler, S J, Hoppe, H, Snyder, J, Silhouette Clipping, Siggraph 2000.
[Seger03] Seger, C, personal communication, 2003.
[Sloan02] Sloan, P-P, Kautz, J, Snyder, J, Precomputed Radiance Transfer for Real-Time Rendering in Dynamic, Low-Frequency Lighting Environments, Siggraph 2002.
[Sloan03] Sloan, P-P, Hall, J, Hart, J, Snyder, J, Clustered Principal Components for Precomputed Radiance Transfer, Siggraph 2003.
[Sloan04] Sloan, P-P, personal communication, 2004.
[Wang03] Wang, Y, Fröhlich, B, Göbel, M, Fast Normal Map Generation for Simplified Meshes, Journal of Graphics Tools, Vol. 7 No. 4, 2003.
[Whitehurst03] Whitehurst, A, Depth Map Based Ambient Occlusion Lighting, 2003.

Additional Reading

Advanced Global Illumination. Philip Dutré, Philippe Bekaert, Kavita Bala, AK Peters 2003.

"Spherical Harmonics, The Gritty Details", Robin Green, GDC 2003.

Practical Precomputed Radiance Transfer, Peter-Pike Sloan, ShaderX2, 2003.

Ambient Occlusion, Matt Pharr, Simon Green, GPU Gems: Programming Techniques, Tips, and Tricks for Real-Time Graphics, Addison Wesley 2004.

General-Purpose computation on GPUs,

Article Start Previous Page 3 of 3

Related Jobs

Deep Silver Volition
Deep Silver Volition — Champaign, Illinois, United States

Senior Engine Programmer
Deep Silver Volition
Deep Silver Volition — Champaign, Illinois, United States

Senior Technical Designer
Random42 — London, England, United Kingdom

UE4 Technical Artist
OPGG, Inc.
OPGG, Inc. — Remote, Remote, Remote

React JS Front-end Engineer (Fortnite) - Remote Hire

Loading Comments

loader image