|
Features

Hardware Accelerating Art Production
Results
The Direct3D Extensions Library (D3DX) provides a number of functions
for software processing of SH PRT as part of the latest public release
(DirectX SDK Update Summer 2003). Since this implementation is both
robust and readily available, it is an ideal reference for comparison
with the hardware version presented earlier.
While not exhaustive, Table 1 shows a trend in the performance
between D3DX and our hardware setup for 9-component vertex PRT.
There is a clear gulf between the two versions as the number of
surface elements increases--something which bodes well for texture
processing.
Two sets of timings are listed for the hardware (HW1 and HW2),
which differ only in the number of samples taken per vertex. The
former is for direct comparison with D3DX; twice as many samples
are used since the hardware method samples over a sphere, so on
average only half the samples contribute anything (that's quite
a contrast to hemispherical sampling in software). Bear in mind
that this still isn't an entirely fair comparison, as D3DX importance
samples via a cosine distribution, which speeds up convergence.
HW2 is supplied for a rough idea of the sort of performance to
expect when previewing using a lower number of samples. The optimum
will vary from scene to scene, however, depending on visibility
variance.
|
|
|
Model
|
Vertices
|
D3DX
(2048 samples)
|
HW1
(4096 samples)
|
HW2
(1024 samples)
|
D3DX/HW1
|
|
shapes1
|
2814
|
14.27s
|
4.94s
|
1.36s
|
2.9x
|
|
head
|
10596
|
151.66s
|
13.73s
|
3.52s
|
11.0x
|
|
skullocc
|
31076
|
581.95s
|
37.77s
|
9.49s
|
15.4x
|
|
 |
 |
 |
Table
1: results
|
These figures were recorded on an Athlon XP 2400+ PC equipped
with a GeforceFX 5800 using a release build and libraries. To put
the hardware version in the best light, meshes were "vertex
cache order" optimized beforehand, although this improves locality
of reference for software as well. The depth resolution for hardware
processing was 512x512x8 bits in all cases.
Visually there is a minor difference in lighting (Figure 8) with
hardware processing--caused by the limited storage precision of
samples--which in my opinion is acceptable for previews. Precision
and read-back can be traded off via the split ramp as required,
but both the number of iterations and the depth map (resolution
and precision) have a greater affect on the quality of the results.
The only major differences are the shadows under the eyelids (Figure
9), which are not captured by the hardware version. This is due
to the limited depth precision, which can be increased as described
earlier. A reduced depth bias combined with nudging vertices outwards
(along the normal) may also improve accuracy. It should also be
noted that the cyan color, present in transition areas between the
red and white lights, is due to the low-order SH lighting approximation.
[Sloan04] reports similar results to these with floating point
hardware. His accelerated PRT implementation--essentially the same
as the one presented in this feature but using version 2.0 shaders
and high-precision buffers--will be available in an upcoming DirectX
SDK Update along with significant optimizations to the software
simulator. The revised API will also be more modular, enabling (among
other things) general CPCA compression of user-created data, as
produced here.
Extensions
It might also be possible to map one or more of the versions presented
earlier onto fixed-function hardware. Whether such cards have the
necessary power to out-do a clever software implementation is unclear
however, and operations may need to be split over extra passes,
cutting performance further.
One trick that may improve vertex AO performance is adaptive ray
sampling. A simple form of this would be to examine vertices after
some fixed number of iterations, find those that appear to be fully
visible or within some margin of error, then through a dynamic index
buffer (or other method) and extra book-keeping, spare these vertices
further processing. A more general solution could look at differences
between blocks of iterations, terminating at a given error threshold.
It's an interesting idea, but extra communication between host and
GPU may counter any potential speed gain.
Other
Applications
The following are a few other examples of hardware-accelerated
preprocesses, some of which have already been employed in game development,
and others that could be used.
Radiosity
Coombe et al. [Coombe03] map a progressive radiosity method almost
completely onto the GPU and in the process their method solves a
couple of the classic problems with hemi-cubes. Firstly the hemi-cube
faces no longer need to be read back to main memory for software
processing of delta form factors. Rather than iterating over the
faces and randomly updating elements--an infeasible task with current
hardware--surface elements, maintained directly in textures, are
instead inverse-transformed to the ID hemi-cube faces. Secondly,
since IDs now only need to be assigned to patches rather than elements,
aliasing--a problem with hemi-cubes when shooting--is also reduced.
The paper is an enlightening read, with clever tricks devised for
face shooting selection and adaptive subdivision. From the point
of view of acceleration, their system reaches a solution rapidly
for simple scenes and intermediate results are available at any
point due to the progressive nature and texture storage.
Static SH Volume
Max Payne 2 uses a static volume of SH estimates over a
game level, for environmental lighting of models at run-time. Remedy
accelerates the offline computation [Lehtinen04]--in particular
the SH projection of environment maps through ps2.0 shaders and
floating-point storage.
At a given grid point, the pre-lit scene is first rendered to an
HDR cube-map. Face texels are then multiplied by the current basis
function evaluated in the corresponding direction. Naturally, since
these directional terms are fixed for all cubes, they can be pre-calculated
and read from a texture. The resulting values are then accumulated
via reductive summing of groups of four neighboring texels through
multiple passes--a process akin to repeated box filtering, just
without the averaging. The final level is then read back and scaled
appropriately, yielding an SH coefficient. Projection is repeated
for all coefficients and rendering for all grid points.
Normal Mapping
Wang et al. [Wang03] describe an image-space method to accelerate
normal map processing using graphics hardware, which has similarities
to the first AO scheme described earlier. The reference mesh is
rendered from a number of viewpoints and depth comparisons are made
in software to determine the nearest surface point and normal.
The authors work around the problem of hidden surface points in
complex meshes by falling back to interpolating target triangle
normals. The process outlined in the paper relies on reading back
the frame buffer, containing the reference normals, and the depth
buffer. It's quite possible that vertex and fragment programs could
be used, as in the case study, to move more of the work to the GPU,
thereby accelerating the process further.
Christian Seger, author of ORB, takes a different approach to normal
map acceleration [Seger03] that avoids the issue of hidden surface
points. For a given triangle of the LP model, triangles from the
HP reference mesh within a search region are first of all culled.
Rather than simply planar projecting these triangles, they are shrunk
down based on the interpolated normal of the target triangle. In
ray-tracing terms this emulates normal, rather than nearest point,
sampling, which reduces artifacts--see [Sander00]. Using the coordinates
calculated from the shrinking process, hardware then renders these
triangles to the normal map, with stenciling used to clip away any
texels outside of the target triangle in texture-space.
Conclusion
This feature, through the case study and other examples, has hopefully
convinced you that that programmable graphics can play a role in
developing more responsive art tools. Furthermore, the potential
for acceleration is not restricted to the very latest floating point
graphics processors.
It's true that software is ultimately more general, often easier
to debug and extend and may offer greater accuracy. But GPU restrictions
are falling away, shader debugging is becoming easier and as the
results show, high numerical accuracy isn't a prerequisite for previews.
When a process can be mapped efficiently onto graphics hardware,
the speed increase can be significant, and GPUs are scaling up faster
than CPUs in terms of raw performance. The implementation can also
be simpler than a comparable software version when the latter needs
extra data structures, algorithms and lower-level optimizations
to cut processing time.
Future shader versions will clear the way for mapping a larger
class of algorithms onto graphics processors and higher-level abstractions
such as BrookGPU are another welcome development. Faster communication
through PCI-Express will also make hybrid solutions more viable.
Acknowledgements
I would like to thank Simon Brown, Willem de Boer, Heine Gundersen,
Peter McNeill, David Pollak, Richard Sim and Neil Wakefield for
comments and support; Peter-Pike Sloan and Rune Vendler for detailed
feedback, information and ideas; Jaakko Lehtinen and Christian Seger
for describing their respective hardware processing schemes; the
guys at Media Mobsters for regular testing of unreliable code on
a range of GPUs; Simon Green and NVIDIA for the Ogre stills and
permission to use the head mesh, modeled by Steve Burke; Microsoft
for permission to publish details of upcoming D3DX features.
References
[Brown03] Brown, S, How To Fix The DirectX Rasterisation Rules,
2003.
[Cignoni98] Cignoni, P, Montani, C, Rocchini, C, Scopigno, R, A
general method for preserving attribute values on simplified meshes,
IEEE Visualization, 1998.
[Coombe03] Coombe, G, Harris, M J, Lastra, A, Radiosity on Graphics
Hardware, June 2003.
[Forsyth03] Forsyth, T, Spherical Harmonics in Actual Games, GDC
Europe 2003.
[James03] James, G, Rendering Objects as Thick Volumes, ShaderX2
2003.
[Landis02] Landis, H, Production-Ready Global Illumination, Siggraph
course notes #16, 2002.
[Lehtinen04] Lehtinen, J, personal communication, 2004.
[Purvis03] Purvis, I, Tokheim, L, Real-Time Ambient Occlusion, 2003.
[Sander00] Sander, P, Gu, X, Gortler, S J, Hoppe, H, Snyder, J,
Silhouette Clipping, Siggraph 2000.
[Seger03] Seger, C, personal communication, 2003.
[Sloan02] Sloan, P-P, Kautz, J, Snyder, J, Precomputed Radiance
Transfer for Real-Time Rendering in Dynamic, Low-Frequency Lighting
Environments, Siggraph 2002.
[Sloan03] Sloan, P-P, Hall, J, Hart, J, Snyder, J, Clustered Principal
Components for Precomputed Radiance Transfer, Siggraph 2003.
[Sloan04] Sloan, P-P, personal communication, 2004.
[Wang03] Wang, Y, Fröhlich, B, Göbel, M, Fast Normal Map
Generation for Simplified Meshes, Journal of Graphics Tools, Vol.
7 No. 4, 2003.
[Whitehurst03] Whitehurst, A, Depth Map Based Ambient Occlusion
Lighting, 2003.
Additional Reading
Advanced Global Illumination. Philip Dutré, Philippe Bekaert,
Kavita Bala, AK Peters 2003.
"Spherical Harmonics, The Gritty Details", Robin Green,
GDC 2003.
Practical Precomputed Radiance Transfer, Peter-Pike Sloan, ShaderX2,
2003.
Ambient Occlusion, Matt Pharr, Simon Green, GPU Gems: Programming
Techniques, Tips, and Tricks for Real-Time Graphics, Addison
Wesley 2004.
General-Purpose computation on GPUs, GPGPU.org.
______________________________________________________
|