We currently see four primary modes of CPU Onloading and are working on identifying workloads for each one.
Intra-frame CPU Onloading is arguably one of the most useful approaches, with the CPU generating a component of a frame such as the UI, particles, etc. and then the GPU leveraging this component in the form of an uploaded buffer. This approach requires buffered rendering as there is currently no mechanism within current generation graphics APIs for cooperative rendering within a frame.
Using intra-frame CPU Onloading a developer can easily tune for platform utilization by using performance heuristics to determine the cost of these intra-frame workloads and tuning for a specific CPU/GPU configuration. A good example of intra-frame CPU Onloading is moving your particle system to the CPU.
It's possible to get good performance in a particle system with a direct quad rasterizer using tiling and binning to generate multiple tasks for parallel processing along with leveraging the Single Instruction Multiple Data (SIMD) units to process and render up to 8 particles at once.
Image 1: The Onloaded Shadows sample illustrates a Full Pipeline CPU Onloading technique
Screen space CPU Onloading is also a potentially interesting approach as it can be easily pipelined n-1 frames deep and post and other full screen effects have a high GPU cost. In general the approach when performing screen space Onloading is to render the scene to a render target which is then read by the CPU. The surface is separated into tiles to enable small consumable chunks for maximum parallel workloads.
We use tasking for asynchronous processing of the data. In order to fully maximize the potential CPU performance on this workload it's important to vectorize the data to keep the SIMD lanes busy and perform multiple operations with a single instruction. The CPU also allows us to provide further optimization with its flexibility and cache layout to perform tasks like tone mapping via an accumulator.
CPU Onloaded data generation is an area we haven't yet done much work in but we believe there is potential for custom rasterizers for shadow map generation on the CPU along with heat generation, etc.
Full pipeline CPU Onloading is the most naïve approach to Onloading and entails using a CPU software rasterizer to render asynchronous components of a scene. The sample “Onloaded Shadows” performs this type of Onloading by using the Microsoft* WARP 10 rasterizer on a single core [Image 1].
Due to the poor mapping of graphics specific APIs like Direct3D* 10 to CPU architecture this approach is the least efficient and care should be taken when evaluating this type of usage to ensure the platform performance is a win.
Figure 3: Pipelining