This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.
As an example of Screen Space CPU Onloading, we can look at HDR Post Processing. We're able to get the performance of this implementation to around 2ms per frame on today's high end CPUs with an asynchronous copy operation done that limits our minimum total frame cost to 3.5ms. Latency hiding is accomplished by pipelining the work across the CPU and GPU so the EUs are working on rendering the next frame while the CPU is performing the post effects. [Figure 3]
Image 2: Tiling
Previously the importance of tasking and vectorizing your code for maximal performance on the CPU was addressed. Applying these concepts to screen space effects is fairly straightforward. In keeping the workload fairly comparable in terms of performance measurement across both devices we'll use a float16 render target.
This render target is copied to the CPU where the processing and filtering occurs. Tasking enables us to easily break up the work of a full screen filtering pass by breaking up the buffer into tiles [Image 2]. Each tile can be worked on independently and each stage of the algorithm can be set as a dependency.
Leveraging the SIMD capabilities of the CPU can maximize our performance. By using the SIMD units we can perform an operation on multiple pieces of data with a single instruction. We can identify opportunities in this particular algorithm beginning with the down-sample step.
We take a 4x4 block of 16 pixels and begin down-sampling the pixels per component using vector addition [Figure 4]. SIMD data is processed “vertically” [Tonteri10] so we need to take a set of four of these vertical (partial) sums and perform a transpose to maximize SIMD utilization.
One additional sum after the transpose gives us our final 4 sums of 16 pixels each (64 original pixels). To complete the down-sampling, multiply by 1/16th for the average.
A similar approach can be taken to calculate the luminance using SIMD as well on four (SSE) or eight (Intel® AVX) averages. The CPU's flexibility is an advantage when calculating the luminance. The classic pixel shader technique requires rendering a full-screen quad into a “luminance” render target which is then down-sampled; potentially needing several passes. Using the CPU, you simply compute the luminance and accumulate to a per-tile global.
Compute the remainder of your post processing (i.e. blur, tone-map and bloom) using similar methods. Once the post processing is complete copy the result to the back buffer. You've now Onloaded your post processing and have freed up that time for the graphics hardware to perform more graphics work!
Now that you've read through a CPU Onloaded workload and have a grasp of the basic techniques, we'd like to suggest you do some experimentation on your workloads and explore these techniques further.
While this is a reasonable introduction to CPU Onloading, the possibilities are vast and maximum potential can be realized when using workloads that have been carefully evaluated for their implementation efficiency on the CPU versus your graphics target.
Doug McNabb was a significant reviewer and contributor to this article in terms of illustrations and core technical content. Additional thanks to Zane Mankowski, Steve Smith, Doug Binks and Artem Brizitsky for their work on developing the Onloaded Shadows sample.
Josh Doss began his career at 3Dlabs enabling some of the first applications to leverage high-level shaders. Josh joined Intel in 2006 to write some of the first software targeting the Intel® architecture (codenamed Larrabee) and is now working on a team that creates developer samples targeting upcoming Intel platforms.
Doss, J & Mcnabb, D. Increase your FPS with CPU Onload [PDF Slides]. Slides presented at GDC 2011, San Francisco CA. Retrieved from http://software.intel.com/en-us/articles/intelgdc2011/
Harris, M. GPU Physics [PDF Slides]. Slides presented at SIGGRAPH 2006, Boston MA. Retrieved from http://developer.download.nvidia.com/presentations/2006/siggraph/gpu_physics-siggraph-06.pdf
Steam Hardware and Software Survey. (March 2011). Retrieved from http://store.steampowered.com/hwsurvey
Tonteri, T. (2010). A Practical Guide to using SSE SIMD with C++. Retrieved February 22nd 2011 from http://sci.tuomastonteri.fi/programming/sse/printable