By Matt
Pritchard
Published in Game Developer Magazine, February
1995
|
Once again, we are off on our quest
for the fastest graphics performance possible. This time, we are going to
take a look at image drawing routines and the factors that affect performance.
In the process, we'll examine many factors that negatively affect performance
and the techniques you can use to minimize them. While I am going to use
Mode X image drawing as our test example, most of these factors apply to
all graphics modes from EGA 16 color to Super VGA True Color.
For the techniques described in this article, let's suppose that we are writing
a game that needs to redraw a tile-based background, like Ultima VI or Gauntlet.
We are using 256-color Mode X at 320-by-200 pixels of resolution, and our
background is 18-by-12 tiles in size. This makes for an update area of 288
pixels by 192 pixels or 55,296 total pixels. Tiles are stored in memory line
by line from left to right, top to bottom.
For testing, I will time each routine on five different CPU and video card
combinations to get a good cross- section of the machines a game designer
would expect his or her programs to run on today. Representing the lower
end of most systems is a 40MHz 386 with a slow ISA Trident VGA card. To contrast
the variations between VGA cards is another 40MHz 386 with a fast ISA ATI
Graphics Ultra Plus. Next, a 33MHz Hewlett Packard 486SX with an S3-based
VGA on the motherboard and a 66MHz 486DX2 computer with a Diamond Stealth
32 on a VLB local bus. Finally, we will test on a 90MHz Pentium computer
with a Diamond Stealth 64 PCI Bus video card. To reduce the impact of memory
caches in our test results, each routine will be called 10,000 times in a
loop using the same data buffers. The final results are summarized in Table
1.
We will start by using the Mode X drawing code from the article "Mode X Revealed"
(Dec. 1994) to write a straightforward tile-drawing routine in Borland C,
as shown in Listing 1.
Our tile-drawing routine is pretty simple and straightforward, wouldn't you
agree? It's small, flexible, uses only integers, and has just three executing
statements. It's also horribly slow and a perfect candidate for our first
speedup technique.
Speedup Technique #1
Do not call a separate pixel-plotting routine in image-drawing code. Use
inline code or functions instead. Think about how many calls it takes to
draw one screen or image.
The crux of this technique can be summed up in one word: overhead. Every
time you call a function or procedure, your program incurs the overhead of
pushing parameters, executing a far call, setting up a stack frame, decoding
parameters, cleaning up the stack, and executing a RETurn instruction. For
something simple, like plotting a single pixel, the time spent handling the
call can approach the time spent executing the core of the function.
Let's think about this: for each screen we redraw, 55,296 calls are made.
Intel's timings show a perfect world time of 24 CPU cycles on a 486 for only
the call, stack frame assignment, and return. In reality, that time can be
more like 35 cycles per call. Using a call for each pixel can eat up nearly
1.3 to 2 million CPU cycles per screen! These are CPU cycles we want to use
for other things.
Now we will rewrite our tile-draw routine, putting the set_point routine
inline using Borland C's inline assembly language command, shown in Listing
2.
Although we added code to preserve registers, this version showed considerable
improvement on our test machines. We recorded speed increases of 22%, 31%,
39%, 38%, and 10% for about a 30% average increase. Put another way, call
overhead made up the difference between 16 frames per second and 20 frames
per second! Call overhead is only the tip of the iceberg. Lurking in both
examples is one of the biggest obstacles to graphics performance: the OUT
instruction.
Speedup Technique #2
Minimize the number of OUT instructions to the video card. Switch video planes
or memory banks as infrequently as possible.
OUT instructions are the only way to access the various control registers
on a graphics card, so they can't be eliminated. You need them to set bit
planes, write masks, and select banks in just about every mode except mode
13h and CGA modes. They are necessary, and they are slow.
According to Intel, an OUT instruction takes 10 CPU cycles on a 386 and 16
cycles on a 486. While it's not the fastest instruction, 16 CPU cycles doesn't
seem that serious. This wouldn't be a problem if the timings told the whole
story, but they don't. The 16 CPU cycles reflect only the CPU processing
overhead. No mention is made of the I/O bus, or the VGA card itself. Even
on a local bus video card, OUTs still have to go through the 8MHz bus protocol,
which was designed for the original IBM PC-AT and PC/XT. In addition to the
ISA bus, the VGA card itself has to respond to the OUT and signal the computer
that it has properly processed it. Finally, as if this wasn't bad enough,
under most memory managers or operating systems, the CPU has to check the
OUT instruction for validity against the I/O permission bitmap, a process
that can add 10 to 20 more cycles.
Given all these factors, the OUT instruction actually takes anywhere from
30 to more than 70 CPU cycles, with a wide variance due to differences in
CPU, motherboard, I/O bus, and video card. With this in mind, how many OUTs
do we really need to draw one of our tiles? Four. One OUT to select each
of the four Mode X planes. If we rewrite our tile-drawing code to plot all
pixels on one plane before going on to the next, it should be much faster
since we will have cut the number of OUTs per time from 256 to 4, about a
98% reduction.
Listing 3 shows our tile-drawing code, rewritten to minimize OUTs. Timing
this version, we see speed increases of 43%, 19%, 51%, 50%, and 127% over
the last version. The scores on the 386 machines illustrate how big the variation
can be due to the video card alone. The speedy Pentium showed how factors
outside the CPU can slow it down. Compared to the routine we started with,
we are averaging about a 100% improvement.
The next couple of speedup techniques are less obvious, but share the same
basic philosophy: reduce overhead by removing unnecessary or redundant portions
of code.
Speedup Technique #3
For frequently drawn images, use specific routines with hard-coded values
instead of general-purpose routines. You can choose the variables you encode
as constants and remove functionality that you don't need.
You should think about the routines you use, especially if they come from
third-party libraries. Ask these questions about your routines:
* Do they have inputs that will always be the same?
* Do they have features that will never be needed?
* Are they more flexible than my program will ever be?
If you answer yes to these questions, then you should consider writing custom
versions of those routines for the specific task at hand.
Imagine a library routine that has neat features like a transparent color,
or clipping the image to a rectangle. But what if we know that certain images
are always solid or are never going to need to be clipped? In this case,
it means every time we use them, the computer spends part of its time checking
for things that will never happen. With a little up-front planning and awareness,
a good game designer will avoid overhead by implementing alternate versions
of those routines which don't have features that will go unused.
In our tile-drawing routine we pass in two variables, Xwide and Ywide, that
tell us how big the tile image is. But, according to our specifications,
the only size tile we will ever draw is 16-by-16. We can make a version
specifically for 16-by-16 tiles and save the overhead of passing those variables
every call. If we need to draw tiles in other sizes, we still have our
general-purpose routine, or we could write another specific-sized routine.
While looking over the tile-drawing code with this in mind, we see another
situation where the code is doing something that may be unnecessary. Why
are we doing a MULtiply every time we plot a pixel? This leads us to yet
another speedup technique.
Speedup Technique #4
Precalculate frequently needed information whenever possible. Use lookup
tables instead of calculations. Image drawing should be about transferring
data, not calculating it.
Taking a closer look at our code, we see that for every pixel plotted, one
MULtiply instruction is performed when calculating the display address. That
is not terrible, but the multiply takes anywhere from 10 to 26 CPU cycles,
and we do it 55,296 times for every screen. If we stop and think about what
is being multiplied, we are hit with this realization: we are multiplying
the same 200 numbers over and over again. Because we know how wide the screen
is going to be, and we know what Y values the pixels will be, why not do
all the multiplying once and save the results in a table? This method is
known as using a lookup table.
By using a lookup table in our tile-drawing routine, we can replace the MUL
instruction that takes 10 to 26 cycles with a lookup sequence that takes
two to four cycles. Even with timings that depend on the exact numbers
multiplied, we should easily see savings of at least a half million cycles
per screen drawn.
Our tile-drawing routine doesn't actually do justice to the benefits a lookup
table can provide. More complex calculations such as square roots, sines
and cosines, vectors, rotation, and scaling factors, can take hundreds of
CPU cycles. But these, too, can be replaced with lookups that take only a
couple of cycles. It should be no surprise that graphics-intensive games
like Wolfenstein 3-D, Doom, TIE Fighter, Wing Commander, and many more make
extensive use of lookup tables.
Now, let's redo our tile-drawing routine and apply speedup techniques 3 and
4, shown in Listing 4.
Timing this version again shows healthy speed increases of 20%, 22%, 33%,
37%, and 37%. Overall, we have achieved about a 100% speed increase on the
386 machines, 180% on the 486, and 240% on the Pentium. It seems like we
are running out of techniques that involve only modifying the code. It's
time to turn our attention toward the actual display data and the VGA card
itself. Understanding in detail how the CPU, memory, and VGA card work and
interact with each other will allow us to uncover facts that we can exploit
to further improve our routines.
Memory, it turns out, is the key. We know that video memory is slower than
normal RAM, so what's the best way to access it? To study this, I've turned
to a tool that you can find on your favorite bulletin board or online service:
Michael Abrash's Zen Timer. I've used the Zen Timer to time how fast video
memory can be written using various methods. The results are summarized in
Table 2. You will notice that 16-bit writes take the same amount of time
as 8-bit writes.
Looking at our tile-drawing code, we are using 8-bit writes to draw one pixel
at a time when we could be using 16-bit writes to draw two pixels in the
same amount of time. This brings us to our next speedup technique.
Speedup Technique #5
When writing data to the VGA card, use 16-bit writes whenever possible. They
take the same amount of time as 8-bit writes, but transfer twice as much
data.
You'll notice that I didn't list timings for 32-bit writes. Thirty-two bit
writes bring up some other issues, such as bus type and REP MOVSD interfering
with sound DMA on some 386 systems. We are going to save this issue for another
time, when we can examine 32-bit graphics in detail.
In Mode X, writing a 16-bit value will plot two pixels that are four pixels
apart instead of right next to each other. But our image data is stored linearly.
We can either gather the two pixels together before each write or use this
next speedup technique.
Speedup Technique #6
Arrange your image data in the form it will be drawn to video memory. For
Mode X or 16-color modes, separate and store image data by video planes.
The way we store our image data is up to us, the game designers. Those decisions
dictate how our graphic routines must work. This gives us another area where
we can design our routines to work as efficiently as possible. Looking at
the video memory timings some more, we come across a possible hitch with
16-bit writes. When a 16-bit value is written to an odd video memory address,
it takes twice as long as an even address. The reason for that lies in how
the VGA card and the bus work together. When the data written spans a 16-bit
boundary, the bus splits it into two separate writes. Further testing shows
that on the local bus video cards tested, 16-bit writes slowed down only
on odd addresses that cross double word boundaries. Aligning write data to
the size of the video bus (16 or 32 bits) gives us another speedup technique.
Speedup Technique #7
When writing images, align the data on word and double word boundaries whenever
possible. It may be advantageous to have separate routines for even and odd
image destinations.
This creates a problem for routines that can draw an image at any position
on the screen. Some positions will start on even addresses, and some will
start on odd addresses. The images drawn 16 bits at a time on odd addresses
will be slower than those on even addresses. Depending on the circumstances,
it may be advantageous to have two separate drawing routines, one for even
destinations and one for odd destinations. In the case of our tile drawing,
the positions we chose for the tiles are at even addresses only, so we can
optimize our code for it.
While system RAM is faster than video RAM, data alignment works the same
way. Most CPUs read 32 bits of aligned data at a time, even if only 8 bits
are needed. A 16- or 32-bit read that spans two 32-bit blocks will be broken
into two separate reads. This gives us another speedup technique that exploits
memory alignment.
Speedup Technique #8
Try to store image data so that it will be aligned on 32-bit boundaries in
system memory. Pad structures and tables so the data after it will be aligned
in system memory.
This may seem like a repeat of Technique #6, but it's not. We could reorder
our image data and store it in system memory on odd memory addresses. We
have to look at where the image data is coming from as well as to where it
is going. When our images are an odd width, it may be advantageous to add
"padding" bytes in between each line of data, so the routine that draws a
line can be assured of the fastest possible reads from system memory.
Taking all of this knowledge about how memory and the VGA card works, we
can go back and rewrite our tile-drawing code again. Just because we did
general purpose optimizations before, doesn't mean we shouldn't look at them
again. Now that we have examined our needs in detail, we have more information
to work with. For example, because we know the size of our images, we can
apply a technique called "loop unrolling." This gives us another speedup
technique.
Speedup Technique #9
Don't forget about normal assembly language optimizations. After applying
other techniques, your code may have changed to where you can use standard
techniques such as loop unrolling, branch elimination, and instruction
substitution. Gear your optimizations toward the 486 and Pentium systems;
286 systems are all but dead now, but many optimizations are still oriented
toward them.
Taking all we know into consideration, we can write our 16-by-16 tile-drawing
code, shown in Listing 5. The code is rewritten completely in assembly language,
with unrolled loops, image data stored by planes, and aligned 16-bit writes.
After testing, we see that image data and alignment techniques really do
work. This time our results are even better, as shown in Table 1. The speed
improvement over our original routine is amazing! We have achieved total
increases of 500% and 1,000% on the 386 machines, 2,000% on the 486 machines,
and 1,500% on the Pentium. But in terms of results, we are still doing exactly
the same thing. At this point, some of our results are looking almost suspect.
The test routines are somewhat idealized because of the attempts to nullify
the cache's impact, but there is no denying that we can get huge performance
increases on every system by applying all of our speedup techniques.
Are there more ways to improve our tile-drawing code? I am sure there are.
The fact that we have not discussed them here doesn't mean they don't exist.
With that thought, I give you a final speedup technique.
Speedup Technique #10
Never assume your routines are as fast as possible. Always keep your mind
open for new ways to improve your code.
By keeping an open mind we learn more and discover more. Perhaps you have
a speedup technique that I've overlooked. If so, why not drop a line to Game
Developer and share it with us. Until next time, happy hacking!
Matt Pritchard is a software developer for Lacerte Software in Dallas,
Texas, and the author of MODEX110, a comprehensive freeware Mode X library.
You can reach him via e-mail at matthewp@netcom.com or through Game
Developer.
|