|
By
Jason Mitchell
Gamasutra
December
5, 1997
|
|
|
Features

Optimizing
Direct3D Applications
For Hardware Acceleration
An
increasing number of 3D application developers---mainly game developers---are
writing applications that take advantage of 3D graphics acceleration via
Microsoft's Direct3D API. While writing to the API itself is relatively
straightforward, getting optimal performance out of the underlying hardware
can prove elusive. This article presents a variety of optimization techniques
as well as insight into how a 3D application interacts with Direct3D and,
ultimately, the 3D graphics accelerator. This knowledge has been gleaned
from my efforts in assisting developers with Direct3D optimization as
well as porting 3D applications to ATI's proprietary API-often from Direct3D
versions. I will also present some hard data gathered from a simple Direct3D
application run on a variety of 3D accelerators.
Optimizing Direct3D code for 3D hardware means minimizing the communication
with the 3D graphics accelerator. This translates to:
- minimizing
render state changes
- separating
2D and 3D operations
- batching/stripping/fanning
vertices
There is
a theoretical minimum number of DrawPrimitive(),
SetRenderState(), and other
calls necessary to render a given scene. An application should minimize
communication with the 3D card in an attempt to approach this theoretical
minimum. Naturally, there are limitations which can prevent a given game
from moving as far as possible toward the theoretical minimum communication
between the API and card (i.e. the game is sort dependent, it has architectural
baggage due to being converted from another platform, or it's poorly architected
but too late to change). Usually, however, there are still changes that
can be made to a Direct3D application to improve its performance on a
variety of hardware.
Minimizing
SetRender Calls
The single
biggest optimization that can be made is minimization of SetRenderState()
calls. One way to design an application or engine to minimize these calls
is to use a material abstraction where each material is an associated
set of render states. For each frame, an application would render all
of the polygons of a given material (set of render states) and then move
on to the next material without revisiting any materials. The engine could
even traverse the materials such that the minimum number of SetRenderState()
calls is made, though this might be going a bit overboard. Of course,
it wouldn't be quite this simple with non-z-buffered applications. Additionally,
alpha-blended polygons should be deferred to the end of the frame and
depth-sorted. Such an architecture will go a long way toward eliminating
redundant SetRenderState()
calls.
Many graphics engines were not originally architected with any kind of
material abstraction or even hardware acceleration in mind. Such applications
can still cut out redundant SetRenderState()
calls by maintaining the current state of all of the Direct3D render states
and only changing a given render state when the hardware is not already
in the right state. ATI has taken this approach when converting a variety
of Direct3D applications to its proprietary API, which is very Direct3D-like,
and we have seen around a 20% speed boost over the Direct3D versions (even
in cases where we have to copy and convert data from D3DTLVERTEX
structures to our ATI specific structures). Much of this speed boost is
due to the elimination of redundant SetRenderState()
calls, but some of it is attributable to being "closer to the metal."
In most cases, the only render state that will change on a transition
from one material to another is the current texture. That is, materials
will, in most cases, map to textures.
Batching polygons of a common texture has many performance benefits. It
eliminates call overhead, minimizes PCI bus traffic, and perhaps most
importantly, batching polygons with common textures minimizes texel cache
thrashing. Newer graphics accelerators come with several kilobytes of
texel cache on chip. In order to keep costs low, these texel caches do
not snoop the PCI bus. That is, without PCI bus snooping these caches
may contain data this is out of sync with the actual video memory addresses
that are cached. As a result, switching textures may result in a complete
flush of the texture cache, while rendering polygons of a common texture
in a batch will dramatically increase inter-texture texel cache hits.
Many developers either ignore the redundant render state issue or assume
that the driver or hardware will check for redundancy. Games must not
rely on drivers or accelerators to check for SetRenderState()
redundancy. The game has the information to best optimize away any redundant
SetRenderState() calls.
Pushing this responsibility downward would be far less efficient than
keeping it at the application level.
In order to illustrate the performance falloff due to the addition of
SetRenderState() calls
to a render loop, I have created a simple Direct3D application which renders
and profiles five different scenes. The code for this application was
modified from the Microsoft flip3D sample application and is available
as a download from Gamasutra. The sample application
renders two quads on the screen for each scene. Each quad is rendered
as a D3DPT_TRIANGLESTRIP
of four vertices. One of the quads is screen aligned, while the other
is somewhat oblique to the screen. See figure 1 where the two quads rendered
in test scenes.
|

Figure 1 - The two quads rendered in test scenes.
Scene 4 is shown.
|
Scene 1 consists
of these two quads rendered without texture maps and with a Gouraud shaded
gradation of color from top to bottom. The function used to render test
Scene1 is shown in Listing 1.
Listing
1 - RenderTimedScene1()
DWORD RenderTimedScene1(int
times_to_render)
{
HRESULT result;
DWORD begin_time, end_time;
int i;
// warm the data cache
InitializeTestTriangle(gTestTriangle);
result = d3dDevice->SetRenderState(D3DRENDERSTATE_TEXTUREHANDLE,
NULL);
// grab timestamp
begin_time = timeGetTime();
for(i=0; i (times_to_render; i++)
{
// render the first triangle
d3dDevice->DrawPrimitive(D3DPT_TRIANGLESTRIP,
D3DVT_TLVERTEX, gTestTriangle, 4, NULL);
// render the second strip
d3dDevice->DrawPrimitive(D3DPT_TRIANGLESTRIP,
D3DVT_TLVERTEX, TestTriangle+4, 4, NULL);
}
// grab another timestamp
end_time = timeGetTime();
return (end_time - begin_time);
}
Scene 2
texture maps both quads with a 256x256 texture while Scene 3 texture
maps both quads with a 128x128 texture. Perspective correction and D3DTBLEND_MODULATE
texture blending are on. As you would expect, the render state for the
current texture is set once and there are no SetRenderState()
calls within the loop. Scene 4 texture maps the two quads with two different
textures as shown in Figure 1 above. Naturally, there are two SetRenderState()
calls in the loop. Scene 5 renders the same image as Scene 4 but there
are redundant SetRenderState()
calls introduced into the loop as shown in Listing 2 to show performance
degradation due to redundant SetRenderState()
calls.
|
Listing
2. Redundant Calls Introduced.
for(i=0; i SetRenderState(D3DRENDERSTATE_TEXTUREHANDLE,gTextureOneHandle);
result = d3dDevice->SetRenderState(D3DRENDERSTATE_TEXTUREHANDLE,
gTextureTwoHandle);
result = d3dDevice->SetRenderState(D3DRENDERSTATE_TEXTUREHANDLE,
gTextureOneHandle);
result = d3dDevice->DrawPrimitive(D3DPT_TRIANGLESTRIP, D3DVT_TLVERTEX,
gTestTriangle, 4, NULL);
result = d3dDevice->SetRenderState(D3DRENDERSTATE_TEXTUREHANDLE,
gTextureTwoHandle);
result = d3dDevice->SetRenderState(D3DRENDERSTATE_TEXTUREHANDLE,
gTextureOneHandle);
result = d3dDevice->SetRenderState(D3DRENDERSTATE_TEXTUREHANDLE,
gTextureTwoHandle);
result = d3dDevice->DrawPrimitive(D3DPT_TRIANGLESTRIP, D3DVT_TLVERTEX,
gTestTriangle+4, 4, NULL);
}
|
I
have measured the performance of a variety of current 3D graphics cards
using the test program with times_to_render set to 2500, resulting in
10,000 triangles per scene. The results for three typical cards are shown
in Figure 2 below. For each card, the time to render the 10,000 triangles
was measured in milliseconds. This number was then converted to triangles
per millisecond and normalized (divided by the Scene 1 score for the given
card). This normalization was done so that the fall-off in performance
is made clear without obscuring the issue with absolute performance comparisons.
|
|
Figure
2. Triangle per millisecond fall-off for Scenes 1 through 5 for
a variety of 3D accelerators. The data was gathered with the application
developed for this article on a 300MHz Pentium II.
|
Both
Card 1 and Card 2 take a minor hit for turning on texture mapping (Scene
1 to Scene 2), while Card 3 takes a significant performance hit. Changing
the size of the texture used in this limited test scenario (Scene 2
to Scene3) does not affect performance on any of the cards. Adding a
SetRenderState() call
to each iteration of the loop to change between two textures (Scene
3 to Scene 4) is a performance penalty on all three cards, particularly
Card 2. Adding the redundant SetRenderState()
calls as shown in Listing 2 degrades performance further still.
I encourage developers interested in this issue to download the source
for this test application and experiment with it. I think it's a good
idea to do this kind of profiling of Direct3D performance and SetRenderState()
tracking in a developer's application as well. Intel is also devoting
time and resources to this issue and the Graphics Toolkit in their recently
released IPEAK family
of platform performance and integration tools is intended to help developers
with just this sort of workload and scene analysis.
It should be pointed out that SetRenderState()
calls are effectively 3D operations that cause communication with the
hardware in 3D mode. As a result, SetRenderState()
calls should be made in 3D mode (within a BeginScene()
- EndScene() block) to prevent the hardware from having to switch
from 2D to 3D and back again when the SetRenderState()
is executed. In the next section, 2D and 3D modes will be discussed
further.
2D
and 3D Modes
Another
big optimization you can make to Direct3D games is minimizing the transitions
between 2D and 3D modes via BeginScene()
and EndScene() calls. On
combination 2D-3D cards (the vast majority of 3D graphics accelerators),
the hardware incurs overhead when switching between 2D and 3D modes. Applications
should attempt to use one render block (a BeginScene()
- EndScene() pair) per frame to cut this overhead to its minimum.
Additionally, as mentioned above, all SetRenderState() calls should be
made while in 3D mode (i.e. in a render block) since they require the
hardware to switch to and from 3D mode if done while in 2D mode.
Operations such as DirectDraw
Lock(), Unlock() and Blt()
calls are 2D operations and can fail if performed within a render block.
Many applications use both 2D and 3D operations to compose a frame. Blts
are often used for heads-up displays (HUDs) and other screen-aligned overlay
primitives. If possible, these 2D blts should be deferred until the end
of the frame, after the EndScene()
and before the Flip(),
since the chip is in 2D mode at this point. Some 3D only cards do not
support Blts. As a result, many developers will use 3D polygons for inherently
2D primitives. This is fine for 3D-only cards, but on 2D/3D cards it can
be more efficient to use Blts
for large 2D primitives such as sky/scenery backdrops. As a result, applications
should detect the hardware's capability of doing Blts (checked via the
various DDCAPS_BLTx flags
returned from a call to GetCaps())
and do Blt operations on
hardware that supports it. Some applications also use a least-recently-used
(LRU) scheme (or other texture management method) in the event that the
application's texture footprint is larger than the amount of video memory
available for textures. In this situation, a game may not realize that
a texture needs to be swapped into video memory until the middle of a
frame. This can result in a series of 3D-2D-3D mode switches as texture
data is moved from system to video RAM mid-frame via 2D operations. This
should be avoided, and with the right design it can. Additionally, the
greater amount of texture memory provided by AGP can reduce this potential
performance penalty.
Stripping
and Fanning Vertices
In Direct3D,
as in any 3D API, more than one primitive can be sent to the rendering
hardware via a single call to the API. This amortizes the call overhead
across all of the primitives rendered due to a given call. Do not make
one DrawPrimitive() call
per polygon. At the very least, primitives should be sent to the hardware
via Direct3D as a D3DPT_TRIANGLELIST
(specified using the first parameter to DrawPrimitive()).
Applications may want to experiment with the number of vertices sent per
DrawPrimitive() call since
this will affect concurrency on 3D hardware that has PCI bus-mastering
capabilities.
Polygons which share vertices (including texture coordinates) and which
should be rendered with identical render states can be organized for more
concise and efficient communication with Direct3D and thus the underlying
hardware. Such groups of vertices can be rendered as a D3DPT_TRIANGLESTRIP
or D3DPT_TRIANGLEFAN. This
can be very efficient if the application's 3D models are structured accordingly,
but can be wasteful on the CPU side if the 3D structures to be rendered
are not already stripped or fanned.
Strive
to Optimize
There
are a variety of basic techniques which can be used to achieve optimal
performance in Direct3D applications, but the main idea behind them is
minimizing the communication with the 3D graphics accelerator. This can
be done by minimizing render state changes, separating 2D and 3D operations
and batching/stripping/fanning vertices. These techniques can have varying
degrees of effectiveness depending on the overall architecture of a 3D
graphics engine, but keeping these basic principles in mind will take
you a long way toward optimal performance on 3D graphics hardware.
Jason L. Mitchell is a
Software Engineer at ATI Research Inc.
(Marlborough, MA). He can be reached at JasonM@atitech.com.
|