Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Gamasutra: The Art & Business of Making Gamesspacer
Sponsored Feature: Onloaded Shadows: Moving Shadow Map Generation from the GPU to the CPU
View All     RSS
August 3, 2020
arrowPress Releases
August 3, 2020
Games Press
View All     RSS

If you enjoy reading this site, you might also want to check out these UBM Tech sites:


Sponsored Feature: Onloaded Shadows: Moving Shadow Map Generation from the GPU to the CPU

January 26, 2011 Article Start Previous Page 3 of 3


Resolution of 1024x768 was used in our performance evaluation. In all scenarios, a 2000ms asynchronous update time was used, with the differing variables being the buffer sizes, number of cascades and the number of synchronous cascades on the GPU (i.e. 1408x4+1 means that 4 cascades at 1408x1408 resolution were used and 1 cascade was processed on the GPU).

Four machines were used in testing this technique to get a comparison between the Onloaded Technique and naive asynchronous GPU Technique. The machines labeled ‘SNB GT1' and ‘SNB GT2' were 2nd Generation Intel Core processor-based machines with 2.2 GHz processors and 4 GB of RAM.

The machine labeled ‘FX 770M' had an NVidia Quadro FX 770M and dual-core 2.8 GHz processors with 4 GB of RAM. The machine labeled ‘HD 5870' had an ATI Radeon HD 5870 discrete graphics card and dual-core 3.20 GHz processors with 3 GB of RAM. All machines ran Microsoft Windows 7.

The data collected is the frame time spikes for the stalls when the synchronous work is done. In the case of the Distributed Stall, the frame time spike is spread out across a number of frames, so the selected frame time is the maximum of these spikes.

The frame times do not include the standard frame time without the spikes; this data is equivalent for all data points, so it is subtracted off all data points.

On the 2nd Generation Intel Core processor-based machines—the primary target of the Onloaded Shadows algorithm, the frame spike was 2 to 4 times lower than the GPU technique. When utilizing the distributed stall optimization, the GPU technique's frame spike was significantly lower but still noticeable, while with the Onloaded technique the overhead is effectively eliminated.

On Intel® Processor Graphics, the Onloaded Shadows technique is the fastest technique for handling asynchronously calculated shadows, with or without the distributed stall optimization.

On the NVidia Quadro FX 770M, the Onloaded Shadows technique incurs more overhead than the GPU technique. This is expected because the data must be transferred from the CPU to a discrete graphics card, which is significantly slower than when they are on the same die. When using the distributed stall technique, Onloaded Shadows is faster in every scenario, and once again approaches negligible frame times.

On the ATI Radeon HD 5870, the GPU technique is much faster than the Onloaded Shadows in every aspect. The GPU is much faster at processing the shadow maps, and the overhead of copying data from the CPU to a discrete graphics card remains. This demonstrates that onloading is not a viable technique for high-end discrete graphics cards.

Note that Onloaded Shadows has another advantage not apparent with the data above: Onloaded Shadows is a more consistent technique than is the GPU technique. Because the synchronous work done is simply a copy operation, the overhead of that copy operation will never change as long as the buffer sizes and cascade counts stay the same.

However, with the GPU technique, the GPU is drawing to a shadow buffer and applying various post-processing to that buffer, which means that the speed is heavily dependent upon the camera position and the current scene complexity. As this could vary widely over the course of a video game, the frame time spike could also vary widely and be much harder to distribute evenly when using the Distributed Stall optimization.

Should the architecture of WARP change at any point in the future to allow multi-threading without stalling the main thread, this technique could take advantage of that functionality. Testing WARP with all of the threads enabled (ignoring the stalls) on an Intel Core i7 processor demonstrates that shadow map generation on the CPU could potentially be improved by a factor of 300 percent, going from roughly 600ms to 200ms.

Future Work

There are various additions that could improve Onloaded Shadows. For example, WARP is not a software rasterizer designed to be as fast as possible. A properly optimized software rasterizer could be used instead of WARP for performance benefits.

Another potential issue is fast-moving view frustums. When the view frustum moves fast enough, the view has a chance of seeing a low-quality cascade up close for a small amount of time. Frustum path prediction can solve this for certain scenarios such as pre-computed camera paths.

Additionally, instead of generating the shadows as soon as possible, shadows can be generated only when the light has moved a certain amount or if the camera has moved enough to warrant it. This reduces the workload on the CPU, freeing up resources for additional tasks.

Finally, the Onloading technique could be expanded to other areas where graphics work does not have to be done per frame and, instead, only has to be done every few seconds. Some potential graphics techniques resulting from this could be Onloaded Environment Maps, Onloaded Lightmaps, and Onloaded Global Illumination.


Shadow map generation can have a significant effect upon the speed that a GPU is able to render a scene. In certain scenarios, shadow map generation can be done asynchronously instead of every frame, enhancing performance. Onloaded Shadows uses WARP to run the shadow map generation asynchronously on the CPU. Testing shows that Onloading Shadow map generation onto the CPU is significantly faster than any comparable GPU technique on machines with integrated graphics. This technique is also viable for certain kinds of discrete graphics cards when properly optimized.

About the Authors

Zane Mankowski was a software engineer intern in the Intel® Visual Computing Software Division. He is currently pursuing a Bachelor's degree in Computer Science at Rochester Institute of Technology.

Josh Doss, Steve Smith, and Doug Binks also contributed to this article.

References / Resources

Glaister, Andy. Windows Advanced Rasterization Platform (WARP) In-Depth Guide. MSDN. 11/08

Tuft, David. Cascaded Shadow Maps. MSDN. 06/10

Article Start Previous Page 3 of 3

Related Jobs

Mountaintop Studios
Mountaintop Studios — Los Angeles, California, United States

Engine/Systems Engineer (remote)
Mountaintop Studios
Mountaintop Studios — Los Angeles, California, United States

Graphics Engineer (remote)
Yacht Club Games
Yacht Club Games — Los Angeles, California, United States

Senior 3D Technical Artist
Mountaintop Studios
Mountaintop Studios — Los Angeles, California, United States

Network Engineer (remote)

Loading Comments

loader image