Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Hardware Accelerating Art Production
June 2, 2020
Press Releases
June 2, 2020
Games Press

If you enjoy reading this site, you might also want to check out these UBM Tech sites:

# Hardware Accelerating Art Production

[]

March 19, 2004 Page 2 of 3

Mapping AO onto the GPU

Reformulation

One method of directly mapping of AO's hemispherical sampling onto hardware is the hemi-cube [Purvis03], but this is computationally expensive: the scene must be transformed and rendered multiple times for each surface element. A more efficient alternative comes from tracing many coherent rays together in the opposite direction. An intuitive implementation of this, developed by Weta Digital and described in [Whitehurst03] involves surrounding the scene with a sphere of lights, as Figure 5 shows. The light directions are used for sample weighting, while associated depth maps (our coherent rays) are used for visibility determination.

 Figure 5: Sampling over a sphere. The viewpoints (represented as cones) are processed one at a time. This batches together visibility rays through a shadow depth map for each orientation.

Equation 3 encapsulates Monte Carlo integration in this instance. For each element p, weighted visibility samples (from this point on simply referred to as "samples") are accumulated and averaged via the weight sum w. Because the sample directions si cover the unit sphere, those outside of a point's vision are given a zero weighting by the hemispherical function H. The visibility function V is just a depth comparison between the surface element in depth map space and the corresponding map value.

 Equation 3a: Monte Carlo integration over a sphere

 Equation 3b: Definitions

We can use graphics hardware to quickly render depth maps, but it's possible to go further and gain full benefit from the GPU after overcoming a couple of hurdles. The following work-through covers the stages involved in accomplishing this.

First Try

Before getting more adventurous in the kitchen, let's start baking using plain rasterization hardware and software processing. Here's the recipe:

1. Pick an orientation around scene center
2. Render geometry from this viewpoint
4. Transform each surface element into depth buffer space
5. Perform visibility test via depth comparison
6. Repeat above steps, accumulating element samples and weights
7. Calculate AO for each element from sample and weight totals

This is a valid procedure for graphics chips that lack any sort of programmability, but otherwise hardware capabilities are going to waste; per-element transform and triangle rasterization (in the case of texture baking) are tasks better performed by dedicated hardware. The performance of this approach is also hurt by read-back, a big deal given the high number of iterations necessary to avoid noise (or banding because of shared viewpoints).

To boost processing speed substantially, we must make more effective use of the GPU. As the remaining steps show, this leads to reducing read-back as well.

Second Attempt

The time-consuming steps--bar-depth transfer--are amenable to stream processing, so let's take advantage of this. Depth rendering will remain largely as before, but we can replace software transform and comparison of surface elements with shaders. A vertex shader computes the weight for a given surface element and orientation. A corresponding pixel shader performs the necessary depth check, with the weight and test result written out. As Figure 6 captures, we now have two accelerated stages: a depth pass and a sampling pass.

 Figure 6: The shader passes in the second attempt at mapping AO onto the GPU.

For vertex AO, points (i.e., D3DPT_POINTLIST) are sent through the new shader pair, each occupying a single pixel--containing the weight and test result--in the render target. With these laid out contiguously in vertex order, the target contents can be simply iterated over in software.

In the case of texture baking, the process is slightly more involved, as triangles are rasterized instead using non-overlapping UVs specified by the artist or generated automatically. The weight is also calculated per-pixel, with the vertex shader performing setup instead.

Read-back has been reduced somewhat compared to the first try, assuming that surface resolution (number of elements) is lower than depth buffer resolution--under-sampling would occur otherwise. A jump in speed should be expected as well, depending on relative GPU and CPU muscle, plus the amount of time spent optimizing the last version. The new process is also simpler since the dedicated hardware takes care of lower-level computations.

Despite these gains, room for improvement remains, as the pipeline is still stalled by read-back. If the summation stage were to be moved to the graphics card, the GPU would be kept busy and we would only have to read the totals back at the end. This is possible via ps2.0 shaders and high-precision targets, but cards which support these features are not ubiquitous and the extra accuracy can come with a speed and storage hit. As the final version reveals however, with a little craftiness the GPU can perform partial summation without the features just mentioned, further cutting back-transfer by well over an order of magnitude.

Third Time Lucky

A split ramp, a neat idea borrowed from [James03], is the solution to the problem of accumulating values in an 8-bit-per-component buffer. Consider the case of vertex AO from the previous attempt. Rather than just passing the weight through the pixel shader, the value can be used to index a special texture which splits it across color components. High bits are returned in one channel and the low bits in another.

The issue of precision has been avoided up until now, but as the results will show, 8 bits are plenty for a single weight or sample--for preview purposes, anyway. Accordingly, the split ramp chops the weight into two 4-bit parts over R and G. Samples can be handled in the same way with the weight in B and A, masked by the visibility result. The empty high bits in all components allow values to be accumulated via alpha blending, saturating after 16 passes or so. Figure 7 illustrates this extended process for a single element.

 Figure 7: The process of splitting weights and samples across multiple color components.

After every block of iterations, the host reconstructs sums from the target, adding them to main totals held in main memory for each element. The target is then cleared to zero for the next set of iterations.

We have reduced the frequency of the read-back to 1/16th, on top of the earlier improvements, and processing is now blazingly fast for typical game meshes and texture sizes. Timings and analysis are provided later in the article.

Practicalities

There are several practical issues I have glossed over that can affect quality and correctness, some of which are independent of this case study. While perspective projection could be used for depth map rendering, as suggested by earlier talk of a "sphere of lights", orthographic projection suits our needs perfectly.

With the former, extra shader math is required because the eye direction (or sample direction depending on the point of view) varies, and the resolution is biased towards near features. Orthographic rendering, on the other hand, offers uniformity and constructing a tight frustum is effortless, since the dimensions are best based upon the scene's bounding sphere for consistency across all viewpoints.

With Direct3D, as used here, the issue of rasterization rules comes into play. Geometry is sampled at pixel centers--a mismatch with texture lookup, which reads from texel edges. When rendering to a texture with subsequent reading in another pass, it's simplest to adjust the rasterization coordinates and [Brown03] presents a clean way to achieve this. It's a good idea to get this aspect set up and tested with predictable data from the outset, so that you can avoid problems later, subtle or otherwise.

Filtering is another easily forgotten but important issue when using certain lookup tables. With a split ramp, point sampling is required to ensure a correct value is returned in regions where the low bits wrap around.

All of the variations in the next section make use of an 8-bit pseudo depth buffer. In most cases it's possible to swap in a more accurate 16-bit version (using a splitting scheme) without needing a separate pass for the depth comparison. Hardware shadow mapping can also be used when supported.

Implementations

It has already been noted that AO can be calculated and stored at every mesh vertex or separately as an occlusion map. The former is a good option with constant ambient lighting, provided that surfaces are sufficiently tessellated to capture shadow changes well. This requirement goes away with the latter and while an AO texture results in a storage cost, one may be able to pack the channel with another map in order to save a texture stage.

Another option is available when high-poly models are used for normal map generation. The extra information can be used to compute a more detailed AO texture, at the cost of additional processing time.

Also, precomputed radiance transfer can go beyond AO (and bent normals) through extra terms, handling low-frequency lighting situations more accurately--again at the cost of storage, although the data can be compressed. Rendering is affected as well, since SH lighting requires some additional support.

These situations are all dealt with in the remainder of this section, particularly with regard to accelerated preprocessing.

Vertex Baking

There is very little more to say on the workings of vertex AO and the shader code in Listing 1 should be self-explanatory.

Listing 1.

////////////////////////////////////////////////
// depth.vsh
////////////////////////////////////////////////
vs.1.1
// c0       : Rasterisation offset
// c1-4     : World*View*Proj. matrix
dcl_position    v0
// Output projected coordinates
m4x4 r0, v0, c1
// Output depth via diffuse colour register
mov oD0, r0.z
////////////////////////////////////////////////
// sampling_v.vsh
////////////////////////////////////////////////
vs.1.1
// c0       : Rasterisation offset
// c1-4     : World*View*Proj. matrix
// c5       : Sample direction
def c8,   2.0, -2.0, -1.0,  1.0
def c9,   0.5, -0.5,  0.0,  1.0
def c10,  0.0,  0.0,  0.0,  0.504
dcl_position    v0
dcl_normal      v1
dcl_texcoord    v2
// Scale and offset texture coordinates
// to [-1, 1] range for render target
mov r0.zw, c9.zw
// Output coordinates for rasterising
// Project vertex coordinates
m4x4 r0, v0, c1
// Output depth via diffuse colour register
// (for consistency with depth pass)
mov oD0, r0.z
// Output bias for depth test
mov oD1, c10.w
// Scale and offset projected coordinates
// for depth map lookup:
// x' =  x*0.5 + 0.5*w
// y' = -y*0.5 + 0.5*w
// z' =  0
// w' =  w
mul r0, r0, c9
// Cosine weighting: max(N.s_i, 0)
dp3 r0.z, v1, c5
max r0.z, r0.z, c9.z
// Output weight, to be split via ramp lookup
mov oT1.x, r0.z
mov oT1.yzw, c9.zzw
////////////////////////////////////////////////
// sampling_v.psh
////////////////////////////////////////////////
ps.1.1
def c0, 1.0, 1.0, -1.0, -1.0  // Sample mask
tex t0  // Depth
tex t1  // Weight (R & G), sample (B & A)
// Compute depth difference, with
// a bias added for cnd (0.5 + 1/255)
sub r0.a, t0.a, v0.a
// Output weight and sample (zero if occluded)
mul_sat r1, t1, c0
cnd r0, r0.a, t1, r1


Texture Baking

For texture AO, weights must be calculated per-pixel. To this end, a transform of the normal into world-space replaces previous weight operations in the vertex shader. The interpolated normal is then used to index a cubic split ramp, which performs the necessary normalization per-pixel. The remaining math comes courtesy of the ramp, and we arrive at the split weight as before. This all works because the world z-axis is the sample direction, which points in the opposite direction of the camera.

With a little effort, these two versions can be unified. A cube-map lookup might come at a cost for vertex AO, but with a little fiddling it's possible to set things up so that only the split texture need change between the cases.

Texture Baking II

While a texture offers higher fidelity than vertex occlusion, in light of normal mapping the additional resolution is somewhat under-used. There are a couple of ways to increase texture detail if high-poly (HP) reference geometry is available.

The first approach is to make use of the derived normal map and calculate the weight in the pixel shader, split via a dependent texture look-up--still manageable in ps1.1 without spreading sampling work across passes. This gives the extra detail of the HP model with the coarse geometric occlusion of the low-poly (LP) mesh.

For more faithful reproduction of the extra detail, ATI's Normal Mapper tool has the option of sampling AO per texel during normal map generation. This is rather costly however; a more frugal approach would be to precalculate AO at the vertices of the HP mesh using hardware. Occlusion then becomes just another attribute sampled by the ray caster (as described by [Cignoni99]).

While a single occlusion term is extremely compact and a useful lighting component for games right now and can be stretched beyond its assumptions, it has its limits. Precomputed radiance transfer (PRT) using the SH basis functions [Sloan02] generalizes AO (the 0th term) and can capture additional directionality and therefore other effects. With nine or more transfer components, soft shadows noticeably track as lights move. Although this comes at the cost of increased storage, vectors can be efficiently compressed through Clustered Principle Component Analysis (CPCA) [Sloan03].

The assumptions for PRT to work, at least in the form presented here, are as with AO except that incoming radiance is assumed to be distant rather than constant. Lighting, now approximated via basis coefficients, can be factored out with the remaining integral evaluated for all surface points, yielding a transfer vector. Without compression, outgoing radiance Rp is reconstructed as a large inner product between the vector L (expressing environmental lighting) and the transfer vector Tp scaled by the diffuse surface response, as Equation 4 shows.

 Equation 4: Outgoing radiance (from a diffuse surface and distant lighting) approximated via spherical harmonics

So our occlusion term is now a vector, computed in much the same way as with AO, except for the presence of basis functions Bi. Indeed, Equation 5 is very similar to Equation 3, but with a summation for each term.

 Equation 5: Transfer vector (Monte Carlo integration)

For our hardware implementation, each basis function Bi is evaluated in software and uploaded as a constant, since the sampling direction is fixed for a given viewpoint. Scaling and biasing is required with these terms so that they fall within the range [0,1], to prevent clipping during split ramp lookup. Accumulated values are later range-expanded in software.

Assuming a fairly liberal number of PRT components, it makes sense to factor out parts of the sampling process that are constant over all components. These are namely weights, which are currently packed with samples and visibility determination.

With the operations moved to separate passes that write their results to additional textures, shader resources are freed up, thus allowing two PRT components to be processed in parallel per sampling pass--at least in the case of vertex baking.

One side benefit of this extra partitioning is that it's now easier to swap in a more accurate set of depth render and depth test passes, perhaps at run-time based on a quality setting. Listing 2 provides just the shaders for packed PRT sampling as the rest (depth, depth compare and weights passes) can be derived from Listing 1.

Listing 2.

////////////////////////////////////////////////
// sampling_prt_v.vsh
////////////////////////////////////////////////
vs.1.1
// c0       : Rasterisation offset
// c1-4     : World*View*Proj. matrix
// c5       : Sample direction
// c6       : Packed SH basis terms:
//          : t0*scale0, t1*scale1, bias0, bias1
def c8,  2.0, -2.0, -1.0,  1.0
def c9,  0.5, -0.5,  0.0,  1.0
dcl_position    v0
dcl_normal      v1
dcl_texcoord    v2
// Scale and offset texture coordinates
// to [-1, 1] range for render target
mov r0.zw, c9.zw
// Output coordinates for rasterising
// Output coordinates for depth result lookup
mov oT0, v2
// Cosine weighting: max(N.s_i, 0)
dp3 r0.z, v1, c5
max r0.z, r0.z, c9.z
// Sample = B(s)*[V(s)]*Hn(s), scaled and
// biased to [0, 1] range
// V(s) is evaluated in the pixel shader
//
// The following values are passed through
// a pair of 2D split ramps:
// Scaled and biased samples in x and y
mov oT1.zw, c9.zw
// Biases in x and y (occluded case)
// Note: since these are fixed, a pixel
mov oT2.xy, c6.zw
mov oT2.zw, c9.zw
////////////////////////////////////////////////
// sampling_prt_v.psh
////////////////////////////////////////////////
ps.1.1
tex t0  // Depth test result - V(s)
tex t1  // Sample0, sample1
tex t2  // Bias0, bias1
// Use depth test result to mask packed samples
mov r0.a, t0.a
cnd r0, r0.a, t1, t2


Page 2 of 3

### Related Jobs

Question — Remote, California, United States
[05.30.20]

Senior Gameplay Engineer (Unreal Engine, Work from Home)
Question — Remote, California, United States
[05.30.20]

Senior Network Engineer (Unreal Engine, Work from Home)
Remedy Entertainment — Espoo, Finland
[05.29.20]

Senior Programmer
Remedy Entertainment — Espoo, Finland
[05.29.20]

Senior Rigging Artist