nce again, we are off on our quest for the fastest graphics performance possible. This time, we are going to take a look at image drawing routines and the factors that affect performance. In the process, we'll examine many factors that negatively affect performance and the techniques you can use to minimize them. While I am going to use Mode X image drawing as our test example, most of these factors apply to all graphics modes from EGA 16 color to Super VGA True Color.
For the techniques described in this article, let's suppose that we are writing a game that needs to redraw a tile-based background, like Ultima VI or Gauntlet. We are using 256-color Mode X at 320-by-200 pixels of resolution, and our background is 18-by-12 tiles in size. This makes for an update area of 288 pixels by 192 pixels or 55,296 total pixels. Tiles are stored in memory line by line from left to right, top to bottom.
For testing, I will time each routine on five different CPU and video card combinations to get a good cross- section of the machines a game designer would expect his or her programs to run on today. Representing the lower end of most systems is a 40MHz 386 with a slow ISA Trident VGA card. To contrast the variations between VGA cards is another 40MHz 386 with a fast ISA ATI Graphics Ultra Plus. Next, a 33MHz Hewlett Packard 486SX with an S3-based VGA on the motherboard and a 66MHz 486DX2 computer with a Diamond Stealth 32 on a VLB local bus. Finally, we will test on a 90MHz Pentium computer with a Diamond Stealth 64 PCI Bus video card. To reduce the impact of memory caches in our test results, each routine will be called 10,000 times in a loop using the same data buffers. The final results are summarized in Table 1.
We will start by using the Mode X drawing code from the article "Mode X Revealed" (Dec. 1994) to write a straightforward tile-drawing routine in Borland C, as shown in Listing 1.
Our tile-drawing routine is pretty simple and straightforward, wouldn't you agree? It's small, flexible, uses only integers, and has just three executing statements. It's also horribly slow and a perfect candidate for our first speedup technique. Speedup Technique #1
Do not call a separate pixel-plotting routine in image-drawing code. Use inline code or functions instead. Think about how many calls it takes to draw one screen or image.
The crux of this technique can be summed up in one word: overhead. Every time you call a function or procedure, your program incurs the overhead of pushing parameters, executing a far call, setting up a stack frame, decoding parameters, cleaning up the stack, and executing a RETurn instruction. For something simple, like plotting a single pixel, the time spent handling the call can approach the time spent executing the core of the function.
Let's think about this: for each screen we redraw, 55,296 calls are made. Intel's timings show a perfect world time of 24 CPU cycles on a 486 for only the call, stack frame assignment, and return. In reality, that time can be more like 35 cycles per call. Using a call for each pixel can eat up nearly 1.3 to 2 million CPU cycles per screen! These are CPU cycles we want to use for other things.
Now we will rewrite our tile-draw routine, putting the set_point routine inline using Borland C's inline assembly language command, shown in Listing 2.
Although we added code to preserve registers, this version showed considerable improvement on our test machines. We recorded speed increases of 22%, 31%, 39%, 38%, and 10% for about a 30% average increase. Put another way, call overhead made up the difference between 16 frames per second and 20 frames per second! Call overhead is only the tip of the iceberg. Lurking in both examples is one of the biggest obstacles to graphics performance: the OUT instruction. Speedup Technique #2
Minimize the number of OUT instructions to the video card. Switch video planes or memory banks as infrequently as possible.
OUT instructions are the only way to access the various control registers on a graphics card, so they can't be eliminated. You need them to set bit planes, write masks, and select banks in just about every mode except mode 13h and CGA modes. They are necessary, and they are slow.
According to Intel, an OUT instruction takes 10 CPU cycles on a 386 and 16 cycles on a 486. While it's not the fastest instruction, 16 CPU cycles doesn't seem that serious. This wouldn't be a problem if the timings told the whole story, but they don't. The 16 CPU cycles reflect only the CPU processing overhead. No mention is made of the I/O bus, or the VGA card itself. Even on a local bus video card, OUTs still have to go through the 8MHz bus protocol, which was designed for the original IBM PC-AT and PC/XT. In addition to the ISA bus, the VGA card itself has to respond to the OUT and signal the computer that it has properly processed it. Finally, as if this wasn't bad enough, under most memory managers or operating systems, the CPU has to check the OUT instruction for validity against the I/O permission bitmap, a process that can add 10 to 20 more cycles.
Given all these factors, the OUT instruction actually takes anywhere from 30 to more than 70 CPU cycles, with a wide variance due to differences in CPU, motherboard, I/O bus, and video card. With this in mind, how many OUTs do we really need to draw one of our tiles? Four. One OUT to select each of the four Mode X planes. If we rewrite our tile-drawing code to plot all pixels on one plane before going on to the next, it should be much faster since we will have cut the number of OUTs per time from 256 to 4, about a 98% reduction.
Listing 3 shows our tile-drawing code, rewritten to minimize OUTs. Timing this version, we see speed increases of 43%, 19%, 51%, 50%, and 127% over the last version. The scores on the 386 machines illustrate how big the variation can be due to the video card alone. The speedy Pentium showed how factors outside the CPU can slow it down. Compared to the routine we started with, we are averaging about a 100% improvement.
The next couple of speedup techniques are less obvious, but share the same basic philosophy: reduce overhead by removing unnecessary or redundant portions of code. Speedup Technique #3
For frequently drawn images, use specific routines with hard-coded values instead of general-purpose routines. You can choose the variables you encode as constants and remove functionality that you don't need.
You should think about the routines you use, especially if they come from third-party libraries. Ask these questions about your routines:
* Do they have inputs that will always be the same?
* Do they have features that will never be needed?
* Are they more flexible than my program will ever be?
If you answer yes to these questions, then you should consider writing custom versions of those routines for the specific task at hand.
Imagine a library routine that has neat features like a transparent color, or clipping the image to a rectangle. But what if we know that certain images are always solid or are never going to need to be clipped? In this case, it means every time we use them, the computer spends part of its time checking for things that will never happen. With a little up-front planning and awareness, a good game designer will avoid overhead by implementing alternate versions of those routines which don't have features that will go unused.
In our tile-drawing routine we pass in two variables, Xwide and Ywide, that tell us how big the tile image is. But, according to our specifications, the only size tile we will ever draw is 16-by-16. We can make a version specifically for 16-by-16 tiles and save the overhead of passing those variables every call. If we need to draw tiles in other sizes, we still have our general-purpose routine, or we could write another specific-sized routine.
While looking over the tile-drawing code with this in mind, we see another situation where the code is doing something that may be unnecessary. Why are we doing a MULtiply every time we plot a pixel? This leads us to yet another speedup technique. Speedup Technique #4
Precalculate frequently needed information whenever possible. Use lookup tables instead of calculations. Image drawing should be about transferring data, not calculating it.
Taking a closer look at our code, we see that for every pixel plotted, one MULtiply instruction is performed when calculating the display address. That is not terrible, but the multiply takes anywhere from 10 to 26 CPU cycles, and we do it 55,296 times for every screen. If we stop and think about what is being multiplied, we are hit with this realization: we are multiplying the same 200 numbers over and over again. Because we know how wide the screen is going to be, and we know what Y values the pixels will be, why not do all the multiplying once and save the results in a table? This method is known as using a lookup table.
By using a lookup table in our tile-drawing routine, we can replace the MUL instruction that takes 10 to 26 cycles with a lookup sequence that takes two to four cycles. Even with timings that depend on the exact numbers multiplied, we should easily see savings of at least a half million cycles per screen drawn.
Our tile-drawing routine doesn't actually do justice to the benefits a lookup table can provide. More complex calculations such as square roots, sines and cosines, vectors, rotation, and scaling factors, can take hundreds of CPU cycles. But these, too, can be replaced with lookups that take only a couple of cycles. It should be no surprise that graphics-intensive games like Wolfenstein 3-D, Doom, TIE Fighter, Wing Commander, and many more make extensive use of lookup tables.
Now, let's redo our tile-drawing routine and apply speedup techniques 3 and 4, shown in Listing 4.
Timing this version again shows healthy speed increases of 20%, 22%, 33%, 37%, and 37%. Overall, we have achieved about a 100% speed increase on the 386 machines, 180% on the 486, and 240% on the Pentium. It seems like we are running out of techniques that involve only modifying the code. It's time to turn our attention toward the actual display data and the VGA card itself. Understanding in detail how the CPU, memory, and VGA card work and interact with each other will allow us to uncover facts that we can exploit to further improve our routines.
Memory, it turns out, is the key. We know that video memory is slower than normal RAM, so what's the best way to access it? To study this, I've turned to a tool that you can find on your favorite bulletin board or online service: Michael Abrash's Zen Timer. I've used the Zen Timer to time how fast video memory can be written using various methods. The results are summarized in Table 2. You will notice that 16-bit writes take the same amount of time as 8-bit writes.
Looking at our tile-drawing code, we are using 8-bit writes to draw one pixel at a time when we could be using 16-bit writes to draw two pixels in the same amount of time. This brings us to our next speedup technique. Speedup Technique #5
When writing data to the VGA card, use 16-bit writes whenever possible. They take the same amount of time as 8-bit writes, but transfer twice as much data.
You'll notice that I didn't list timings for 32-bit writes. Thirty-two bit writes bring up some other issues, such as bus type and REP MOVSD interfering with sound DMA on some 386 systems. We are going to save this issue for another time, when we can examine 32-bit graphics in detail.
In Mode X, writing a 16-bit value will plot two pixels that are four pixels apart instead of right next to each other. But our image data is stored linearly. We can either gather the two pixels together before each write or use this next speedup technique. Speedup Technique #6
Arrange your image data in the form it will be drawn to video memory. For Mode X or 16-color modes, separate and store image data by video planes.
The way we store our image data is up to us, the game designers. Those decisions dictate how our graphic routines must work. This gives us another area where we can design our routines to work as efficiently as possible. Looking at the video memory timings some more, we come across a possible hitch with 16-bit writes. When a 16-bit value is written to an odd video memory address, it takes twice as long as an even address. The reason for that lies in how the VGA card and the bus work together. When the data written spans a 16-bit boundary, the bus splits it into two separate writes. Further testing shows that on the local bus video cards tested, 16-bit writes slowed down only on odd addresses that cross double word boundaries. Aligning write data to the size of the video bus (16 or 32 bits) gives us another speedup technique. Speedup Technique #7
When writing images, align the data on word and double word boundaries whenever possible. It may be advantageous to have separate routines for even and odd image destinations.
This creates a problem for routines that can draw an image at any position on the screen. Some positions will start on even addresses, and some will start on odd addresses. The images drawn 16 bits at a time on odd addresses will be slower than those on even addresses. Depending on the circumstances, it may be advantageous to have two separate drawing routines, one for even destinations and one for odd destinations. In the case of our tile drawing, the positions we chose for the tiles are at even addresses only, so we can optimize our code for it.
While system RAM is faster than video RAM, data alignment works the same way. Most CPUs read 32 bits of aligned data at a time, even if only 8 bits are needed. A 16- or 32-bit read that spans two 32-bit blocks will be broken into two separate reads. This gives us another speedup technique that exploits memory alignment. Speedup Technique #8
Try to store image data so that it will be aligned on 32-bit boundaries in system memory. Pad structures and tables so the data after it will be aligned in system memory.
This may seem like a repeat of Technique #6, but it's not. We could reorder our image data and store it in system memory on odd memory addresses. We have to look at where the image data is coming from as well as to where it is going. When our images are an odd width, it may be advantageous to add "padding" bytes in between each line of data, so the routine that draws a line can be assured of the fastest possible reads from system memory.
Taking all of this knowledge about how memory and the VGA card works, we can go back and rewrite our tile-drawing code again. Just because we did general purpose optimizations before, doesn't mean we shouldn't look at them again. Now that we have examined our needs in detail, we have more information to work with. For example, because we know the size of our images, we can apply a technique called "loop unrolling." This gives us another speedup technique. Speedup Technique #9
Don't forget about normal assembly language optimizations. After applying other techniques, your code may have changed to where you can use standard techniques such as loop unrolling, branch elimination, and instruction substitution. Gear your optimizations toward the 486 and Pentium systems; 286 systems are all but dead now, but many optimizations are still oriented toward them.
Taking all we know into consideration, we can write our 16-by-16 tile-drawing code, shown in Listing 5. The code is rewritten completely in assembly language, with unrolled loops, image data stored by planes, and aligned 16-bit writes.
After testing, we see that image data and alignment techniques really do work. This time our results are even better, as shown in Table 1. The speed improvement over our original routine is amazing! We have achieved total increases of 500% and 1,000% on the 386 machines, 2,000% on the 486 machines, and 1,500% on the Pentium. But in terms of results, we are still doing exactly the same thing. At this point, some of our results are looking almost suspect. The test routines are somewhat idealized because of the attempts to nullify the cache's impact, but there is no denying that we can get huge performance increases on every system by applying all of our speedup techniques.
Are there more ways to improve our tile-drawing code? I am sure there are. The fact that we have not discussed them here doesn't mean they don't exist. With that thought, I give you a final speedup technique. Speedup Technique #10
Never assume your routines are as fast as possible. Always keep your mind open for new ways to improve your code.
By keeping an open mind we learn more and discover more. Perhaps you have a speedup technique that I've overlooked. If so, why not drop a line to Game Developer and share it with us. Until next time, happy hacking!
Matt Pritchard is a software developer at Ensemble Studios.