**Mapping AO onto the GPU**

**Reformulation**

One method of directly mapping of AO's hemispherical sampling onto
hardware is the hemi-cube [Purvis03], but this is computationally
expensive: the scene must be transformed and rendered multiple times
for each surface element. A more efficient alternative comes from
tracing many coherent rays together in the opposite direction. An
intuitive implementation of this, developed by Weta Digital and
described in [Whitehurst03] involves surrounding the scene with
a sphere of lights, as Figure 5 shows. The light directions are
used for sample weighting, while associated depth maps (our coherent
rays) are used for visibility determination.

Equation 3 encapsulates Monte Carlo integration in this instance.
For each element *p*, weighted visibility samples (from this
point on simply referred to as "samples") are accumulated
and averaged via the weight sum *w*. Because the sample directions
*s*_{i} cover the unit sphere, those outside of a point's
vision are given a zero weighting by the hemispherical function
*H*. The visibility function *V* is just a depth comparison
between the surface element in depth map space and the corresponding
map value.

We can use graphics hardware to quickly render depth maps, but
it's possible to go further and gain full benefit from the GPU after
overcoming a couple of hurdles. The following work-through covers
the stages involved in accomplishing this.

**First Try**

Before getting more adventurous in the kitchen, let's start baking
using plain rasterization hardware and software processing. Here's
the recipe:

1. Pick an orientation around scene center

2. Render geometry from this viewpoint

3. Read back depth information

4. Transform each surface element into depth buffer space

5. Perform visibility test via depth comparison

6. Repeat above steps, accumulating element samples and weights

7. Calculate AO for each element from sample and weight totals

This is a valid procedure for graphics chips that lack any sort
of programmability, but otherwise hardware capabilities are going
to waste; per-element transform and triangle rasterization (in the
case of texture baking) are tasks better performed by dedicated
hardware. The performance of this approach is also hurt by read-back,
a big deal given the high number of iterations necessary to avoid
noise (or banding because of shared viewpoints).

To boost processing speed substantially, we must make more effective
use of the GPU. As the remaining steps show, this leads to reducing
read-back as well.

**Second Attempt**

The time-consuming steps--bar-depth transfer--are amenable to stream
processing, so let's take advantage of this. Depth rendering will
remain largely as before, but we can replace software transform
and comparison of surface elements with shaders. A vertex shader
computes the weight for a given surface element and orientation.
A corresponding pixel shader performs the necessary depth check,
with the weight and test result written out. As Figure 6 captures,
we now have two accelerated stages: a depth pass and a sampling
pass.

For vertex AO, points (i.e., D3DPT_POINTLIST) are sent through
the new shader pair, each occupying a single pixel--containing the
weight and test result--in the render target. With these laid out
contiguously in vertex order, the target contents can be simply
iterated over in software.

In the case of texture baking, the process is slightly more involved,
as triangles are rasterized instead using non-overlapping UVs specified
by the artist or generated automatically. The weight is also calculated
per-pixel, with the vertex shader performing setup instead.

Read-back has been reduced somewhat compared to the first try,
assuming that surface resolution (number of elements) is lower than
depth buffer resolution--under-sampling would occur otherwise. A
jump in speed should be expected as well, depending on relative
GPU and CPU muscle, plus the amount of time spent optimizing the
last version. The new process is also simpler since the dedicated
hardware takes care of lower-level computations.

Despite these gains, room for improvement remains, as the pipeline
is still stalled by read-back. If the summation stage were to be
moved to the graphics card, the GPU would be kept busy and we would
only have to read the totals back at the end. This is possible via
ps2.0 shaders and high-precision targets, but cards which support
these features are not ubiquitous and the extra accuracy can come
with a speed and storage hit. As the final version reveals however,
with a little craftiness the GPU can perform partial summation without
the features just mentioned, further cutting back-transfer by well
over an order of magnitude.

**Third Time Lucky**

A split ramp, a neat idea borrowed from [James03], is the solution
to the problem of accumulating values in an 8-bit-per-component
buffer. Consider the case of vertex AO from the previous attempt.
Rather than just passing the weight through the pixel shader, the
value can be used to index a special texture which splits it across
color components. High bits are returned in one channel and the
low bits in another.

The issue of precision has been avoided up until now, but as the
results will show, 8 bits are plenty for a single weight or sample--for
preview purposes, anyway. Accordingly, the split ramp chops the
weight into two 4-bit parts over R and G. Samples can be handled
in the same way with the weight in B and A, masked by the visibility
result. The empty high bits in all components allow values to be
accumulated via alpha blending, saturating after 16 passes or so.
Figure 7 illustrates this extended process for a single element.

After every block of iterations, the host reconstructs sums from
the target, adding them to main totals held in main memory for each
element. The target is then cleared to zero for the next set of
iterations.

We have reduced the frequency of the read-back to 1/16th, on top
of the earlier improvements, and processing is now blazingly fast
for typical game meshes and texture sizes. Timings and analysis
are provided later in the article.

**Practicalities**

There are several practical issues I have glossed over that can
affect quality and correctness, some of which are independent of
this case study. While perspective projection could be used for
depth map rendering, as suggested by earlier talk of a "sphere
of lights", orthographic projection suits our needs perfectly.

With the former, extra shader math is required because the eye
direction (or sample direction depending on the point of view) varies,
and the resolution is biased towards near features. Orthographic
rendering, on the other hand, offers uniformity and constructing
a tight frustum is effortless, since the dimensions are best based
upon the scene's bounding sphere for consistency across all viewpoints.

With Direct3D, as used here, the issue of rasterization rules comes
into play. Geometry is sampled at pixel centers--a mismatch with
texture lookup, which reads from texel edges. When rendering to
a texture with subsequent reading in another pass, it's simplest
to adjust the rasterization coordinates and [Brown03] presents a
clean way to achieve this. It's a good idea to get this aspect set
up and tested with predictable data from the outset, so that you
can avoid problems later, subtle or otherwise.

Filtering is another easily forgotten but important issue when
using certain lookup tables. With a split ramp, point sampling is
required to ensure a correct value is returned in regions where
the low bits wrap around.

All of the variations in the next section make use of an 8-bit
pseudo depth buffer. In most cases it's possible to swap in a more
accurate 16-bit version (using a splitting scheme) without needing
a separate pass for the depth comparison. Hardware shadow mapping
can also be used when supported.

**Implementations**

It has already been noted that AO can be calculated and stored
at every mesh vertex or separately as an occlusion map. The former
is a good option with constant ambient lighting, provided that surfaces
are sufficiently tessellated to capture shadow changes well. This
requirement goes away with the latter and while an AO texture results
in a storage cost, one may be able to pack the channel with another
map in order to save a texture stage.

Another option is available when high-poly models are used for
normal map generation. The extra information can be used to compute
a more detailed AO texture, at the cost of additional processing
time.

Also, precomputed radiance transfer can go beyond AO (and bent
normals) through extra terms, handling low-frequency lighting situations
more accurately--again at the cost of storage, although the data
can be compressed. Rendering is affected as well, since SH lighting
requires some additional support.

These situations are all dealt with in the remainder of this section,
particularly with regard to accelerated preprocessing.

**Vertex Baking **

There is very little more to say on the workings of vertex AO and
the shader code in Listing 1 should be self-explanatory.

**Listing 1. **

////////////////////////////////////////////////
// depth.vsh
////////////////////////////////////////////////
vs.1.1
// c0 : Rasterisation offset
// c1-4 : World*View*Proj. matrix
dcl_position v0
// Output projected coordinates
m4x4 r0, v0, c1
mad oPos, r0.w, c0, r0
// Output depth via diffuse colour register
mov oD0, r0.z
////////////////////////////////////////////////
// sampling_v.vsh
////////////////////////////////////////////////
vs.1.1
// c0 : Rasterisation offset
// c1-4 : World*View*Proj. matrix
// c5 : Sample direction
def c8, 2.0, -2.0, -1.0, 1.0
def c9, 0.5, -0.5, 0.0, 1.0
def c10, 0.0, 0.0, 0.0, 0.504
dcl_position v0
dcl_normal v1
dcl_texcoord v2
// Scale and offset texture coordinates
// to [-1, 1] range for render target
mad r0.xy, v2.xy, c8.xy, c8.zw
mov r0.zw, c9.zw
// Output coordinates for rasterising
mad oPos, r0.w, c0, r0
// Project vertex coordinates
m4x4 r0, v0, c1
// Output depth via diffuse colour register
// (for consistency with depth pass)
mov oD0, r0.z
// Output bias for depth test
mov oD1, c10.w
// Scale and offset projected coordinates
// for depth map lookup:
// x' = x*0.5 + 0.5*w
// y' = -y*0.5 + 0.5*w
// z' = 0
// w' = w
mul r0, r0, c9
mad oT0, r0.w, c9.xxzz, r0
// Cosine weighting: max(N.s_i, 0)
dp3 r0.z, v1, c5
max r0.z, r0.z, c9.z
// Output weight, to be split via ramp lookup
mov oT1.x, r0.z
mov oT1.yzw, c9.zzw
////////////////////////////////////////////////
// sampling_v.psh
////////////////////////////////////////////////
ps.1.1
def c0, 1.0, 1.0, -1.0, -1.0 // Sample mask
tex t0 // Depth
tex t1 // Weight (R & G), sample (B & A)
// Compute depth difference, with
// a bias added for cnd (0.5 + 1/255)
sub r0.a, t0.a, v0.a
add r0.a, r0.a, v1.a
// Output weight and sample (zero if occluded)
mul_sat r1, t1, c0
cnd r0, r0.a, t1, r1

**Texture Baking**

For texture AO, weights must be calculated per-pixel. To this end,
a transform of the normal into world-space replaces previous weight
operations in the vertex shader. The interpolated normal is then
used to index a cubic split ramp, which performs the necessary normalization
per-pixel. The remaining math comes courtesy of the ramp, and we
arrive at the split weight as before. This all works because the
world z-axis is the sample direction, which points in the opposite
direction of the camera.

With a little effort, these two versions can be unified. A cube-map
lookup might come at a cost for vertex AO, but with a little fiddling
it's possible to set things up so that only the split texture need
change between the cases.

**Texture Baking II**

While a texture offers higher fidelity than vertex occlusion, in
light of normal mapping the additional resolution is somewhat under-used.
There are a couple of ways to increase texture detail if high-poly
(HP) reference geometry is available.

The first approach is to make use of the derived normal map and
calculate the weight in the pixel shader, split via a dependent
texture look-up--still manageable in ps1.1 without spreading sampling
work across passes. This gives the extra detail of the HP model
with the coarse geometric occlusion of the low-poly (LP) mesh.

For more faithful reproduction of the extra detail, ATI's Normal
Mapper tool has the option of sampling AO per texel during normal
map generation. This is rather costly however; a more frugal approach
would be to precalculate AO at the vertices of the HP mesh using
hardware. Occlusion then becomes just another attribute sampled
by the ray caster (as described by [Cignoni99]).

**Precomputed Radiance Transfer**

While a single occlusion term is extremely compact and a useful
lighting component for games right now and can be stretched beyond
its assumptions, it has its limits. Precomputed radiance transfer
(PRT) using the SH basis functions [Sloan02] generalizes AO (the
0^{th} term) and can capture additional directionality and
therefore other effects. With nine or more transfer components,
soft shadows noticeably track as lights move. Although this comes
at the cost of increased storage, vectors can be efficiently compressed
through Clustered Principle Component Analysis (CPCA) [Sloan03].

The assumptions for PRT to work, at least in the form presented
here, are as with AO except that incoming radiance is assumed to
be distant rather than constant. Lighting, now approximated via
basis coefficients, can be factored out with the remaining integral
evaluated for all surface points, yielding a transfer vector. Without
compression, outgoing radiance *R*_{p} is reconstructed
as a large inner product between the vector *L* (expressing
environmental lighting) and the transfer vector *T*_{p}
scaled by the diffuse surface response, as Equation 4 shows.

So our occlusion term is now a vector, computed in much the same
way as with AO, except for the presence of basis functions B_{i}.
Indeed, Equation 5 is very similar to Equation 3, but with a summation
for each term.

For our hardware implementation, each basis function B_{i}
is evaluated in software and uploaded as a constant, since the sampling
direction is fixed for a given viewpoint. Scaling and biasing is
required with these terms so that they fall within the range [0,1],
to prevent clipping during split ramp lookup. Accumulated values
are later range-expanded in software.

Assuming a fairly liberal number of PRT components, it makes sense
to factor out parts of the sampling process that are constant over
all components. These are namely weights, which are currently packed
with samples and visibility determination.

With the operations moved to separate passes that write their results
to additional textures, shader resources are freed up, thus allowing
two PRT components to be processed in parallel per sampling pass--at
least in the case of vertex baking.

One side benefit of this extra partitioning is that it's now easier
to swap in a more accurate set of depth render and depth test passes,
perhaps at run-time based on a quality setting. Listing 2 provides
just the shaders for packed PRT sampling as the rest (depth, depth
compare and weights passes) can be derived from Listing 1.

**Listing
2.**

////////////////////////////////////////////////
// sampling_prt_v.vsh
////////////////////////////////////////////////
vs.1.1
// c0 : Rasterisation offset
// c1-4 : World*View*Proj. matrix
// c5 : Sample direction
// c6 : Packed SH basis terms:
// : t0*scale0, t1*scale1, bias0, bias1
def c8, 2.0, -2.0, -1.0, 1.0
def c9, 0.5, -0.5, 0.0, 1.0
dcl_position v0
dcl_normal v1
dcl_texcoord v2
// Scale and offset texture coordinates
// to [-1, 1] range for render target
mad r0.xy, v2.xy, c8.xy, c8.zw
mov r0.zw, c9.zw
// Output coordinates for rasterising
mad oPos, r0.w, c0, r0
// Output coordinates for depth result lookup
mov oT0, v2
// Cosine weighting: max(N.s_i, 0)
dp3 r0.z, v1, c5
max r0.z, r0.z, c9.z
// Sample = B(s)*[V(s)]*Hn(s), scaled and
// biased to [0, 1] range
// V(s) is evaluated in the pixel shader
//
// The following values are passed through
// a pair of 2D split ramps:
// Scaled and biased samples in x and y
mad oT1.xy, r0.zz, c6.xy, c6.zw
mov oT1.zw, c9.zw
// Biases in x and y (occluded case)
// Note: since these are fixed, a pixel
// shader constant could be used instead
mov oT2.xy, c6.zw
mov oT2.zw, c9.zw
////////////////////////////////////////////////
// sampling_prt_v.psh
////////////////////////////////////////////////
ps.1.1
tex t0 // Depth test result - V(s)
tex t1 // Sample0, sample1
tex t2 // Bias0, bias1
// Use depth test result to mask packed samples
mov r0.a, t0.a
cnd r0, r0.a, t1, t2