Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Gamasutra: The Art & Business of Making Gamesspacer
Sponsored Feature: CPU Onloading - Leveraging the PC Platform
View All     RSS
November 27, 2020
arrowPress Releases
November 27, 2020
Games Press
View All     RSS

If you enjoy reading this site, you might also want to check out these UBM Tech sites:


Sponsored Feature: CPU Onloading - Leveraging the PC Platform

April 27, 2011 Article Start Previous Page 3 of 3

Onloading in action: HDR Post Processing

As an example of Screen Space CPU Onloading, we can look at HDR Post Processing. We're able to get the performance of this implementation to around 2ms per frame on today's high end CPUs with an asynchronous copy operation done that limits our minimum total frame cost to 3.5ms. Latency hiding is accomplished by pipelining the work across the CPU and GPU so the EUs are working on rendering the next frame while the CPU is performing the post effects. [Figure 3]

Image 2: Tiling

Previously the importance of tasking and vectorizing your code for maximal performance on the CPU was addressed. Applying these concepts to screen space effects is fairly straightforward. In keeping the workload fairly comparable in terms of performance measurement across both devices we'll use a float16 render target.

This render target is copied to the CPU where the processing and filtering occurs. Tasking enables us to easily break up the work of a full screen filtering pass by breaking up the buffer into tiles [Image 2]. Each tile can be worked on independently and each stage of the algorithm can be set as a dependency.

Figure 4

Leveraging the SIMD capabilities of the CPU can maximize our performance. By using the SIMD units we can perform an operation on multiple pieces of data with a single instruction. We can identify opportunities in this particular algorithm beginning with the down-sample step.

We take a 4x4 block of 16 pixels and begin down-sampling the pixels per component using vector addition [Figure 4]. SIMD data is processed “vertically” [Tonteri10] so we need to take a set of four of these vertical (partial) sums and perform a transpose to maximize SIMD utilization.

One additional sum after the transpose gives us our final 4 sums of 16 pixels each (64 original pixels). To complete the down-sampling, multiply by 1/16th for the average.

Figure 5

A similar approach can be taken to calculate the luminance using SIMD as well on four (SSE) or eight (Intel® AVX) averages. The CPU's flexibility is an advantage when calculating the luminance. The classic pixel shader technique requires rendering a full-screen quad into a “luminance” render target which is then down-sampled; potentially needing several passes. Using the CPU, you simply compute the luminance and accumulate to a per-tile global.

Compute the remainder of your post processing (i.e. blur, tone-map and bloom) using similar methods. Once the post processing is complete copy the result to the back buffer. You've now Onloaded your post processing and have freed up that time for the graphics hardware to perform more graphics work!

Now Go Experiment

Now that you've read through a CPU Onloaded workload and have a grasp of the basic techniques, we'd like to suggest you do some experimentation on your workloads and explore these techniques further.

While this is a reasonable introduction to CPU Onloading, the possibilities are vast and maximum potential can be realized when using workloads that have been carefully evaluated for their implementation efficiency on the CPU versus your graphics target.


Doug McNabb was a significant reviewer and contributor to this article in terms of illustrations and core technical content. Additional thanks to Zane Mankowski, Steve Smith, Doug Binks and Artem Brizitsky for their work on developing the Onloaded Shadows sample.

Author Bio:

Josh Doss began his career at 3Dlabs enabling some of the first applications to leverage high-level shaders. Josh joined Intel in 2006 to write some of the first software targeting the Intel® architecture (codenamed Larrabee) and is now working on a team that creates developer samples targeting upcoming Intel platforms.


Doss, J & Mcnabb, D. Increase your FPS with CPU Onload [PDF Slides]. Slides presented at GDC 2011, San Francisco CA.  Retrieved from

Harris, M. GPU Physics [PDF Slides]. Slides presented at SIGGRAPH 2006, Boston MA. Retrieved from

Steam Hardware and Software Survey. (March 2011). Retrieved from

Tonteri, T. (2010). A Practical Guide to using SSE SIMD with C++. Retrieved February 22nd 2011 from

Article Start Previous Page 3 of 3

Related Jobs

innogames — Hamburg, Germany

Concept Artist - New Mobile Game
Remedy Entertainment
Remedy Entertainment — Espoo, Finland

Development Manager (Xdev Team)
Lost Boys Interactive
Lost Boys Interactive — austin, Texas, United States

FX/Tech Artist
The Gearbox Entertainment Co.
The Gearbox Entertainment Co. — Frisco, Texas, United States

Enemy Designer

Loading Comments

loader image