Sponsored Feature: Onloaded Shadows: Moving Shadow Map Generation from the GPU to the CPU

By Zane Mankowski

[In this Intel-sponsored Gamasutra feature, a special game-related "onloading" technique called Onloaded Shadows is explored, examining notable performance ramifications and future improvement possibilities.]

With the recent introduction of 2nd Generation Intel® Core™ processors (formerly code named "Sandy Bridge"), graphics functionality is increasingly becoming more tightly integrated with the CPU.

There are many interesting opportunities and techniques to increase the cooperation of the CPU and GPU, including "onloading" graphics techniques, which several of my colleagues are working on.

This article explores an "onloading" technique called Onloaded Shadows, developed by Zane Mankowski with support from Josh Doss, Steve Smith, and Doug Binks. In addition to explaining the technique itself, Zane and team also include interesting performance numbers on processor graphics and discreet graphics cards.

Once you have read through the details, download the source code and give it a try.

-- Orion Granatir


Many games have outdoor scenes where the sun is often the primary light and changes direction slowly over time. Generating shadow maps for these outdoor scenes and for static objects isn't required every frame. They can be generated asynchronously to frame rendering, at a cadence of only a few times a second or even once every few seconds.

Using the GPU to generate these shadow maps synchronously, we can split the workload apart and distribute it across several frames. The CPU can perform this workload asynchronously with Microsoft's Windows Advanced Rasterization Platform (WARP) software rasterizer.

The Onloaded Shadows technique uses WARP to asynchronously generate shadow maps. Copying the data from the CPU to the GPU is the only synchronous work required. The overhead of the copy operation is distributed across several frames to reduce the impact.

Figure 1: Screenshot of the application with Onloaded Shadows technique.

This technique uses WARP for CPU-side rasterization to generate the shadow map on the CPU. By default, WARP uses all available cores on a system, resulting in stalls on the main thread due to thread contention. The WARP device also supports running on a single core; we've chosen this approach for Onloaded Shadows – resulting in the use of only two threads in use.


Shadow Map Algorithm

The original Cascaded Shadow Maps algorithm isn't suitable for this Onloaded Shadows because it's not view-invariant. A significant advantage of Cascaded Shadow Maps is that it renders a shadow map only for the areas directly intersected by the view frustum.

Because the view frustum may rotate and move quickly, the frustum can intersect areas not covered by the cascades before a new shadow map is generated by the onloaded pass.

The solution used in Onloaded Shadows is to center the cascades on the view camera, yielding lower quality shadow maps while keeping many of the advantages of the cascaded shadow map technique.

Camera movement must be slow enough to avoid the viewpoint entering the next level of the cascade prior to the generation of a new shadow map.

Figure 2: Screenshot of the sample implementation from the light's view with cascades visualized.

Shadow maps for the nearest cascade are generated every frame on the GPU in order to allow dynamic shadows for nearby objects. The division of cascades across GPU/CPU boundary can be adjusted depending on a performance heuristic.

Technique Overview

The main thread renders the scene using shadow map data stored on the GPU, while the WARP thread generates the shadow map asynchronously. The WARP thread copies the shadow map to a staging buffer and maps it to a subresource. The GPU then updates its shadow buffer with the mapped subresource. The new camera data is utilized once a copy is complete, and then the WARP thread is signaled to once again begin shadow map generation.

Alternatively, asynchronous shadows can be naively implemented by generating shadows synchronously on the GPU every set number of frames.

In this way, a GPU technique which generates the same results can be used to compare with the Onloaded Shadows technique, and performance can be compared by looking at how much of a spike in frame time occurs during either the subresource copy (for Onloaded Shadows) or during the synchronous shadow processing (for the GPU technique).

A significant frame time spike occurring every few seconds would cause a noticeable stall and would be disadvantageous to any game or product that uses this technique. The work done during the synchronous frame is broken down into small enough pieces to cause as little impact as possible. This is called the Distributed Stall optimization.

For the Onloaded Shadows technique, the synchronous copy can be easily subdivided as far as a single byte. For the GPU technique, because the work is not homogenous, breaking apart the shadow processing work becomes significantly more complicated.

For our sample implementation, the work was divided per cascade, and further divided between the original shadow draw and the various post processing passes on the shadow buffer.

Two shadow buffers are used on the GPU because the updates are performed across many frames, with one shadow buffer being copied to while the other's data is used. (If only a single shadow buffer was used, noticeable artifacts would appear while the copy was occurring.) The roles of the two shadow buffers are then swapped.



Resolution of 1024x768 was used in our performance evaluation. In all scenarios, a 2000ms asynchronous update time was used, with the differing variables being the buffer sizes, number of cascades and the number of synchronous cascades on the GPU (i.e. 1408x4+1 means that 4 cascades at 1408x1408 resolution were used and 1 cascade was processed on the GPU).

Four machines were used in testing this technique to get a comparison between the Onloaded Technique and naive asynchronous GPU Technique. The machines labeled ‘SNB GT1' and ‘SNB GT2' were 2nd Generation Intel Core processor-based machines with 2.2 GHz processors and 4 GB of RAM.

The machine labeled ‘FX 770M' had an NVidia Quadro FX 770M and dual-core 2.8 GHz processors with 4 GB of RAM. The machine labeled ‘HD 5870' had an ATI Radeon HD 5870 discrete graphics card and dual-core 3.20 GHz processors with 3 GB of RAM. All machines ran Microsoft Windows 7.

The data collected is the frame time spikes for the stalls when the synchronous work is done. In the case of the Distributed Stall, the frame time spike is spread out across a number of frames, so the selected frame time is the maximum of these spikes.

The frame times do not include the standard frame time without the spikes; this data is equivalent for all data points, so it is subtracted off all data points.

On the 2nd Generation Intel Core processor-based machines—the primary target of the Onloaded Shadows algorithm, the frame spike was 2 to 4 times lower than the GPU technique. When utilizing the distributed stall optimization, the GPU technique's frame spike was significantly lower but still noticeable, while with the Onloaded technique the overhead is effectively eliminated.

On Intel® Processor Graphics, the Onloaded Shadows technique is the fastest technique for handling asynchronously calculated shadows, with or without the distributed stall optimization.

On the NVidia Quadro FX 770M, the Onloaded Shadows technique incurs more overhead than the GPU technique. This is expected because the data must be transferred from the CPU to a discrete graphics card, which is significantly slower than when they are on the same die. When using the distributed stall technique, Onloaded Shadows is faster in every scenario, and once again approaches negligible frame times.

On the ATI Radeon HD 5870, the GPU technique is much faster than the Onloaded Shadows in every aspect. The GPU is much faster at processing the shadow maps, and the overhead of copying data from the CPU to a discrete graphics card remains. This demonstrates that onloading is not a viable technique for high-end discrete graphics cards.

Note that Onloaded Shadows has another advantage not apparent with the data above: Onloaded Shadows is a more consistent technique than is the GPU technique. Because the synchronous work done is simply a copy operation, the overhead of that copy operation will never change as long as the buffer sizes and cascade counts stay the same.

However, with the GPU technique, the GPU is drawing to a shadow buffer and applying various post-processing to that buffer, which means that the speed is heavily dependent upon the camera position and the current scene complexity. As this could vary widely over the course of a video game, the frame time spike could also vary widely and be much harder to distribute evenly when using the Distributed Stall optimization.

Should the architecture of WARP change at any point in the future to allow multi-threading without stalling the main thread, this technique could take advantage of that functionality. Testing WARP with all of the threads enabled (ignoring the stalls) on an Intel Core i7 processor demonstrates that shadow map generation on the CPU could potentially be improved by a factor of 300 percent, going from roughly 600ms to 200ms.

Future Work

There are various additions that could improve Onloaded Shadows. For example, WARP is not a software rasterizer designed to be as fast as possible. A properly optimized software rasterizer could be used instead of WARP for performance benefits.

Another potential issue is fast-moving view frustums. When the view frustum moves fast enough, the view has a chance of seeing a low-quality cascade up close for a small amount of time. Frustum path prediction can solve this for certain scenarios such as pre-computed camera paths.

Additionally, instead of generating the shadows as soon as possible, shadows can be generated only when the light has moved a certain amount or if the camera has moved enough to warrant it. This reduces the workload on the CPU, freeing up resources for additional tasks.

Finally, the Onloading technique could be expanded to other areas where graphics work does not have to be done per frame and, instead, only has to be done every few seconds. Some potential graphics techniques resulting from this could be Onloaded Environment Maps, Onloaded Lightmaps, and Onloaded Global Illumination.


Shadow map generation can have a significant effect upon the speed that a GPU is able to render a scene. In certain scenarios, shadow map generation can be done asynchronously instead of every frame, enhancing performance. Onloaded Shadows uses WARP to run the shadow map generation asynchronously on the CPU. Testing shows that Onloading Shadow map generation onto the CPU is significantly faster than any comparable GPU technique on machines with integrated graphics. This technique is also viable for certain kinds of discrete graphics cards when properly optimized.

About the Authors

Zane Mankowski was a software engineer intern in the Intel® Visual Computing Software Division. He is currently pursuing a Bachelor's degree in Computer Science at Rochester Institute of Technology.

Josh Doss, Steve Smith, and Doug Binks also contributed to this article.

References / Resources

Glaister, Andy. Windows Advanced Rasterization Platform (WARP) In-Depth Guide. MSDN. 11/08

Tuft, David. Cascaded Shadow Maps. MSDN. 06/10

Return to the full version of this article
Copyright © UBM Tech, All rights reserved