It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.

Search articles, jobs, buyers guide, and more.

By Steve Hill
[Author's Bio]

Gamasutra
March 18, 2004

Introduction

Mapping AO onto the GPU

Results

Printer Friendly Version
   

 

Change Login/Pwd
Post A Job
Post A Project
Post Resume
Post An Event
Post A Contractor
Post A Product
Write An Article
Get In Art Gallery
Submit News

 


 


Latest Letters to the Editor:
Perpetual Layoffs by Alexander Brandon [09.21.2007]

Casual friendliness in MMO's by Colby Poulson [09.20.2007]

Scrum deals and 'What is Scrum?' by Tom Plunket [08.29.2007]


[Submit Letter]

[View All...]
  



Upcoming Events:
Video Game Expo (VGXPO)
Philadelphia, United States
11.21.08

DIG London Game Conference
London, Canada
11.27.08

5th Australasian Conference on Interactive Entertainment
Brisbane, Australia
12.03.08

IEEE Symposium on Computational Intelligence and Games
Perth, Australia
12.15.08

2K Bot Prize
Perth, Australia
12.15.08

[Submit Event]
[View All...]

 


[Enter Forums...]

Note: Discussion forums for Gamasutra are hosted by the IGDA, which is free to join.
 

 

 


Features

Hardware Accelerating Art Production

Results

The Direct3D Extensions Library (D3DX) provides a number of functions for software processing of SH PRT as part of the latest public release (DirectX SDK Update Summer 2003). Since this implementation is both robust and readily available, it is an ideal reference for comparison with the hardware version presented earlier.

While not exhaustive, Table 1 shows a trend in the performance between D3DX and our hardware setup for 9-component vertex PRT. There is a clear gulf between the two versions as the number of surface elements increases--something which bodes well for texture processing.

Two sets of timings are listed for the hardware (HW1 and HW2), which differ only in the number of samples taken per vertex. The former is for direct comparison with D3DX; twice as many samples are used since the hardware method samples over a sphere, so on average only half the samples contribute anything (that's quite a contrast to hemispherical sampling in software). Bear in mind that this still isn't an entirely fair comparison, as D3DX importance samples via a cosine distribution, which speeds up convergence.

HW2 is supplied for a rough idea of the sort of performance to expect when previewing using a lower number of samples. The optimum will vary from scene to scene, however, depending on visibility variance.

Model

Vertices

D3DX (2048 samples)

HW1 (4096 samples)

HW2 (1024 samples)

D3DX/HW1

shapes1

2814

14.27s

4.94s

1.36s

2.9x

head

10596

151.66s

13.73s

3.52s

11.0x

skullocc

31076

581.95s

37.77s

9.49s

15.4x

Table 1: results

These figures were recorded on an Athlon XP 2400+ PC equipped with a GeforceFX 5800 using a release build and libraries. To put the hardware version in the best light, meshes were "vertex cache order" optimized beforehand, although this improves locality of reference for software as well. The depth resolution for hardware processing was 512x512x8 bits in all cases.

Visually there is a minor difference in lighting (Figure 8) with hardware processing--caused by the limited storage precision of samples--which in my opinion is acceptable for previews. Precision and read-back can be traded off via the split ramp as required, but both the number of iterations and the depth map (resolution and precision) have a greater affect on the quality of the results.

Figure 8: Output from software and hardware processing (respectively), (left) D3DX, (right) HW1

The only major differences are the shadows under the eyelids (Figure 9), which are not captured by the hardware version. This is due to the limited depth precision, which can be increased as described earlier. A reduced depth bias combined with nudging vertices outwards (along the normal) may also improve accuracy. It should also be noted that the cyan color, present in transition areas between the red and white lights, is due to the low-order SH lighting approximation.

Figure 9: A close up showing the lack of depth resolution in the hardware implementation compared to software: (left) D3DX, (right) HW1 - note the missing shadow under the eyelids caused by the 8bit depth map.

[Sloan04] reports similar results to these with floating point hardware. His accelerated PRT implementation--essentially the same as the one presented in this feature but using version 2.0 shaders and high-precision buffers--will be available in an upcoming DirectX SDK Update along with significant optimizations to the software simulator. The revised API will also be more modular, enabling (among other things) general CPCA compression of user-created data, as produced here.

Extensions

It might also be possible to map one or more of the versions presented earlier onto fixed-function hardware. Whether such cards have the necessary power to out-do a clever software implementation is unclear however, and operations may need to be split over extra passes, cutting performance further.

One trick that may improve vertex AO performance is adaptive ray sampling. A simple form of this would be to examine vertices after some fixed number of iterations, find those that appear to be fully visible or within some margin of error, then through a dynamic index buffer (or other method) and extra book-keeping, spare these vertices further processing. A more general solution could look at differences between blocks of iterations, terminating at a given error threshold. It's an interesting idea, but extra communication between host and GPU may counter any potential speed gain.

Other Applications

The following are a few other examples of hardware-accelerated preprocesses, some of which have already been employed in game development, and others that could be used.

Radiosity

Coombe et al. [Coombe03] map a progressive radiosity method almost completely onto the GPU and in the process their method solves a couple of the classic problems with hemi-cubes. Firstly the hemi-cube faces no longer need to be read back to main memory for software processing of delta form factors. Rather than iterating over the faces and randomly updating elements--an infeasible task with current hardware--surface elements, maintained directly in textures, are instead inverse-transformed to the ID hemi-cube faces. Secondly, since IDs now only need to be assigned to patches rather than elements, aliasing--a problem with hemi-cubes when shooting--is also reduced.

The paper is an enlightening read, with clever tricks devised for face shooting selection and adaptive subdivision. From the point of view of acceleration, their system reaches a solution rapidly for simple scenes and intermediate results are available at any point due to the progressive nature and texture storage.

Static SH Volume

Max Payne 2 uses a static volume of SH estimates over a game level, for environmental lighting of models at run-time. Remedy accelerates the offline computation [Lehtinen04]--in particular the SH projection of environment maps through ps2.0 shaders and floating-point storage.

At a given grid point, the pre-lit scene is first rendered to an HDR cube-map. Face texels are then multiplied by the current basis function evaluated in the corresponding direction. Naturally, since these directional terms are fixed for all cubes, they can be pre-calculated and read from a texture. The resulting values are then accumulated via reductive summing of groups of four neighboring texels through multiple passes--a process akin to repeated box filtering, just without the averaging. The final level is then read back and scaled appropriately, yielding an SH coefficient. Projection is repeated for all coefficients and rendering for all grid points.

Normal Mapping

Wang et al. [Wang03] describe an image-space method to accelerate normal map processing using graphics hardware, which has similarities to the first AO scheme described earlier. The reference mesh is rendered from a number of viewpoints and depth comparisons are made in software to determine the nearest surface point and normal.

The authors work around the problem of hidden surface points in complex meshes by falling back to interpolating target triangle normals. The process outlined in the paper relies on reading back the frame buffer, containing the reference normals, and the depth buffer. It's quite possible that vertex and fragment programs could be used, as in the case study, to move more of the work to the GPU, thereby accelerating the process further.

Christian Seger, author of ORB, takes a different approach to normal map acceleration [Seger03] that avoids the issue of hidden surface points. For a given triangle of the LP model, triangles from the HP reference mesh within a search region are first of all culled. Rather than simply planar projecting these triangles, they are shrunk down based on the interpolated normal of the target triangle. In ray-tracing terms this emulates normal, rather than nearest point, sampling, which reduces artifacts--see [Sander00]. Using the coordinates calculated from the shrinking process, hardware then renders these triangles to the normal map, with stenciling used to clip away any texels outside of the target triangle in texture-space.

Conclusion

This feature, through the case study and other examples, has hopefully convinced you that that programmable graphics can play a role in developing more responsive art tools. Furthermore, the potential for acceleration is not restricted to the very latest floating point graphics processors.

It's true that software is ultimately more general, often easier to debug and extend and may offer greater accuracy. But GPU restrictions are falling away, shader debugging is becoming easier and as the results show, high numerical accuracy isn't a prerequisite for previews.

When a process can be mapped efficiently onto graphics hardware, the speed increase can be significant, and GPUs are scaling up faster than CPUs in terms of raw performance. The implementation can also be simpler than a comparable software version when the latter needs extra data structures, algorithms and lower-level optimizations to cut processing time.

Future shader versions will clear the way for mapping a larger class of algorithms onto graphics processors and higher-level abstractions such as BrookGPU are another welcome development. Faster communication through PCI-Express will also make hybrid solutions more viable.

Acknowledgements

I would like to thank Simon Brown, Willem de Boer, Heine Gundersen, Peter McNeill, David Pollak, Richard Sim and Neil Wakefield for comments and support; Peter-Pike Sloan and Rune Vendler for detailed feedback, information and ideas; Jaakko Lehtinen and Christian Seger for describing their respective hardware processing schemes; the guys at Media Mobsters for regular testing of unreliable code on a range of GPUs; Simon Green and NVIDIA for the Ogre stills and permission to use the head mesh, modeled by Steve Burke; Microsoft for permission to publish details of upcoming D3DX features.

References

[Brown03] Brown, S, How To Fix The DirectX Rasterisation Rules, 2003.
[Cignoni98] Cignoni, P, Montani, C, Rocchini, C, Scopigno, R, A general method for preserving attribute values on simplified meshes, IEEE Visualization, 1998.
[Coombe03] Coombe, G, Harris, M J, Lastra, A, Radiosity on Graphics Hardware, June 2003.
[Forsyth03] Forsyth, T, Spherical Harmonics in Actual Games, GDC Europe 2003.
[James03] James, G, Rendering Objects as Thick Volumes, ShaderX2 2003.
[Landis02] Landis, H, Production-Ready Global Illumination, Siggraph course notes #16, 2002.
[Lehtinen04] Lehtinen, J, personal communication, 2004.
[Purvis03] Purvis, I, Tokheim, L, Real-Time Ambient Occlusion, 2003.
[Sander00] Sander, P, Gu, X, Gortler, S J, Hoppe, H, Snyder, J, Silhouette Clipping, Siggraph 2000.
[Seger03] Seger, C, personal communication, 2003.
[Sloan02] Sloan, P-P, Kautz, J, Snyder, J, Precomputed Radiance Transfer for Real-Time Rendering in Dynamic, Low-Frequency Lighting Environments, Siggraph 2002.
[Sloan03] Sloan, P-P, Hall, J, Hart, J, Snyder, J, Clustered Principal Components for Precomputed Radiance Transfer, Siggraph 2003.
[Sloan04] Sloan, P-P, personal communication, 2004.
[Wang03] Wang, Y, Fröhlich, B, Göbel, M, Fast Normal Map Generation for Simplified Meshes, Journal of Graphics Tools, Vol. 7 No. 4, 2003.
[Whitehurst03] Whitehurst, A, Depth Map Based Ambient Occlusion Lighting, 2003.

Additional Reading

Advanced Global Illumination. Philip Dutré, Philippe Bekaert, Kavita Bala, AK Peters 2003.

"Spherical Harmonics, The Gritty Details", Robin Green, GDC 2003.

Practical Precomputed Radiance Transfer, Peter-Pike Sloan, ShaderX2, 2003.

Ambient Occlusion, Matt Pharr, Simon Green, GPU Gems: Programming Techniques, Tips, and Tricks for Real-Time Graphics, Addison Wesley 2004.

General-Purpose computation on GPUs, GPGPU.org.


______________________________________________________

[back to] Introduction


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2003 CMP Media LLC

privacy policy
| terms of service