Some performance
issues from AoE needed to be resolved while we were working on AoK,
the biggest of which was AoE's 2D graphics pipeline. The graphics for
AoK are created through a combination of software rendering and hardware
composition. This pipeline had been highly optimized for AoE by hand-coding
most of the system in Assembly, so there was not much additional need
to optimize it for AoK.
But there
were new features to integrate into the 2D pipeline. For one thing,
AoK had more detailed terrain. Also, units that were visually obscured
behind buildings and other obstructions appeared as outlines so players
could see where they were. Both of these systems were implemented as
a mixture of C/C++ and hand-coded Assembly during implementation.
The biggest
challenge in keeping the performance up for the graphics system was
making sure that the sprites used for graphics in the game were properly
tagged as belonging in system memory or video memory. If a sprite was
in the wrong memory type a significant performance hit or even an error
could occur, but it was usually hard to identify these graphics memory
location problems. They were usually marked by a drawing problem on-screen,
such as a shadow drawing on top of a unit instead of under it.
Sprites
used by the software rendering engine needed to be in system memory
so that they could be read and processed. If they resided in video memory
instead, the limited throughput from video memory caused a significant
performance hit to the game. Conversely, sprites bltted by the hardware
that accidentally ended up in system memory would render slowly and
could fail to render at all if the hardware bltter didn't support blts
from system memory.
Pathfinding
problems from AoE also had to be fixed. In AoE, there was a single unit-pathing
system, which was known as "tile pathing" because it broke
the game map down into individual tiles and classified them as "passable"
or "nonpassable." This tile-pathing system was fairly good
at moving units short distances, but it often took too long to find
paths (if it could find one at all), so we created two additional pathing
systems for AoK.
The first
of these two systems, "MIP-map pathing," quickly approximated
distant paths across the map. The basis for MIP-map pathing was the
construction of compact bit vectors that described the passability of
each tile on the map. This system allowed the game to determine quickly
whether it was even possible for a unit to get from its current location
to the general target area. The only way to determine whether the general
area could be reached was through the resolution of the bit vectors.
Once a
unit was within a short distance of its target area, another new pathing
system called "low-level pathing" was used. Low-level pathing
allowed very accurate pathing over short distances. When low-level pathing
failed, the pathing system fell back and used the original tile pathing
from AoE.
Changing
the pathing system from a single, general-purpose system to three special-purpose
systems improved the performance of AoK and also significantly improved
game play since it virtually eliminated the problem of stuck and stopped
units caused by pathing failures.
While
we were able to improve the pathing system for AoK, enhancing the unit-class
hierarchy system was a much more onerous task. The unit-class hierarchy
system from AoE couldn't be changed easily since so many game systems
and so much functionality relied on the old implementation. At its heart,
the game's unit-class system is a hierarchy of derived classes and each
derived class is more specialized than its parent. The special functions
of each derived class are supported by virtual functions exposed by
the classes in the hierarchy. A simplified version of the hierarchy
is shown in Figure 4.
 |
Figure
4. AoK unit class hierarchy
|
From a
programming standpoint, calling a virtual function consumes no more
overhead than a regular class function.
If each
class could implement only its own version of the virtual functions,
then this hierarchy wouldn't cause any function overhead problems. However,
since each level of the hierarchy implements its own special code, it
must also call its parent's version of the derived function to perform
its work. In a hierarchy four classes deep, that means calling three
additional functions. This may not sound like much, but it can add up
when code is executed hundreds of thousands or millions of times.
Some performance
improvement could have be gained by circumventing the hierarchy using
"special case" units. For example, walls are a type of building
unit that do not attack other units and only need to scan their vicinity
for enemy units every few game updates unless they are under attack.
To handle this special case, we could specifically check whether the
current unit being processed is a wall, and if so, skip the code that
is only executed for other buildings. Unfortunately, coding in too many
special cases can also lead to performance losses, because you end up
checking to see whether a unit is one of your many special cases. In
the end, we left unit-class hierarchy in place, and made specific changes
to shortcut to functionality that didn't apply to specific units.
Commercial
Profiling Tools: The Good, the Bad, and the Ugly
Performance
analysis extends beyond evaluating the execution speed of program functions
and subsystems. It also includes measuring memory usage and evaluating
the way the program interacts with other programs and the operating
system. In order to determine where the performance problems were in
AoK, four separate tools were used: Intel's VTune, NuMega's TrueTime,
the Windows NT performance counters, and our own profiling and memory
instrumentation code.
Although
we used Microsoft Visual C++, we did not use the bundled Microsoft Profiler.
There were two reasons for this: we found it difficult to get the Microsoft
product to work correctly (or at all) and the data format from their
profiler was either inadequate or needed post-processing in a spreadsheet
to be minimally useful. Using VTune, TrueTime, and the NT performance
counters we were able to collect, analyze, and present data in a reasonable
fashion.
VTune
is a sampling profiler, which means it has a component that wakes up
every few milliseconds (or whatever amount of time you specify) and
looks at what processes are executing on the CPU(s). When you decide
enough time has elapsed, you can stop VTune and look at the statistics
it produces for each process executed during that time. If you've compiled
your program with debug information, VTune can display which lines of
code were called and what percentage of the elapsed time was consumed
by the executing code.
VTune
is great because you don't need to compile a special version of your
program, it doesn't slow your program down while it runs, and it lets
you see the amount of time the CPU spent executing processes besides
your own. The only major drawback is that you can end up with spurious
data due to this sampling. This can be caused by other processes that
are running in the system, or by running VTune for too long a period.
To improve VTune's accuracy on your own program, it comes with an API
to turn VTune on and off programmatically. This is a very useful feature,
especially when drilling down into the performance of specific subsystems
and smaller sections of code.
We found
that VTune's call-graph functionality couldn't be used with a program
that linked either explicitly or implicitly with DirectDraw. Also, some
applications (including AoK) were too large in terms of code and debug
information in order for VTune to resolve its data back correctly to
a line of code. It seems that some of these problems have been fixed
in VTune 4.5, however.
Another
commercial product that we used was NuMega's TrueTime, which is an instrumenting
profiler. To use this product, you have to make a special TrueTime compilation
of your program that inserts timing code into each module. This can
sometimes be a slow build process, but it's worth it. As the TrueTime
build of your program runs, TrueTime logs which functions are entered,
when they are entered, and when they are exited. This process can be
significantly slower than VTune's effectively real-time performance
but it's a useful second opinion nonetheless. The only big drawback
(and it can be very severe) is that TrueTime can slow down your program
so much that it's impossible to use it for profiling network code. This
problem can also skew profiling statistics for time-based game actions
such as AI or updates that are scheduled to occur at a certain interval
of time.
This performance
hit from TrueTime also made it impractical to use it to analyze the
performance of the graphics subsystem. When system performance relies
on two independent processors (such as the main CPU and the graphics
card), efficient cooperation between both processors is critical so
that they run concurrently and perform operations in parallel. When
TrueTime slowed the CPU (and consequently the AoK rendering load which
the CPU governed), it made the graphics card appear to give better performance
than it actually did.
There
were four drawbacks to both programs. First, neither program can be
run in batch mode, so the programmer has to baby-sit the programs while
they run through each performance test case. Even though we worked on
performance test cases one at a time, it would have been convenient
to run each program in batch mode overnight to gather results from other
test cases. VTune has since added a batch interface in version 4.5 but
support is still lacking in TrueTime.
Second,
performance numbers gathered during the execution of a program need
to be taken with a grain of salt. Due to the multi-threaded nature of
the Windows operating system, other programs (including the performance
tool itself) are effectively running at the same time, and that can
skew performance. Fortunately, multiple performance runs with the same
tool or with different tools can help identify specific problem areas
by correlating all of the results, and analyzing performance over smaller
sections of code can improve accuracy and reduce the time required by
some performance tools.
The third
drawback to these profilers is that it's difficult to use both TrueTime
and VTune together when using Visual C++ as your development environment.
TrueTime cannot instrument code from Visual C++ with VTune installed
because VTune renames certain underlying compile and link programs.
Finally,
although both tools display call graphs, we found it difficult at times
to ascribe performance timings to specific subsystems. For instance,
pathing was sometimes called from both movement and retargeting code,
but we were not able to determine which subsystem was using more of
the pathing code. TrueTime was generally accurate about this, but in
some cases, the numbers it reported just didn't seem to add up. In this
type of case, we had to place our own timing code directly into AoK
to get complete results.
Regardless
of how good today's profiling tools are, they have no understanding
of or insight into the underlying program they profile; profiling and
analysis tools would be significantly more useful if they had some understanding
of what the application was attempting to accomplish. With that kind
of additional functionality, these tools could provide performance statistics
that would greatly enhance the programmer's ability to improve the application
performance. Until that day arrives, you'll have to add profiling and
analysis code to your application for even the most mundane performance
information beyond simple timings and call graphs.
Performance
on the Minimum System
Since
performance statistics can change based on the platform on which the
application is running, it was especially critical to get computer systems
that matched the minimum system specification. To demonstrate this performance
differential and the scalability of AoK, two test cases were run on
the minimum system configuration and one was run on a regular development
workstation (Figure 5). To contrast the data as much as possible in
this example, the first test case uses the maximum AoK settings for
players (eight) and map size (giant). The second test case conforms
to the game settings for the minimum system configuration: four players
on a four-player-sized game map.
Figure
5. Performance Analysis.
|
Test
PC 1
|
Test
Case 1
|
Test
PC 2
|
Test
Case 2
|
166
MHz Pentium |
60
seconds of game play |
Dual
450 MHz Pentium III |
60
seconds of game play |
32 MB RAM |
Eight-player game |
128 MB RAM |
Four-player game |
S3
Virge GX |
Giant
map, largest map available |
Nvidia
TNT2 Ultra |
Four-player
map size |
Windows 98 |
One civ from each of
the four civ art setse |
Windows 2000 |
All civs share same
div art set |
Using
VTune, the percentage of CPU clock cycles spent in each process during
an AoK game was calculated for a 60-
second
period at 30-minute intervals. This was done on the 166MHz Pentium minimum
system (Figure 6), and on a dual 450MHz Pentium III development workstation
(Figure 7).
 |
Figure
6 (left). Four-player and eight-player game CPU process utilizatin
(Pentium-166).
Figure 7 (right). Four-player and eight-player game CPU process
utilization (dual Pentium III-450).
|
As you
can see, the four-player game performs well on the 166MHz Pentium. The
AoK process starts at approximately 60 percent of the CPU and increases
to about 75 percent after 30 minutes. The additional time devoted to
the virtual memory manager (VMM) process at startup is caused by AoK
data as it is paged in and accessed for the first time. In contrast,
the amount of CPU time used by AoK in the eight-player game degrades
over time. This is due to the additional memory requirements to support
so many players and such a large game map. The CPU reaches the point
where it's spending almost as much time managing the virtual memory
system as actually executing the game itself.
Since
the development workstation (Test PC 2) is a dual-processor system and
AoK is single-threaded, the second CPU is idle as the kernel runs. This
is why the NTOSKRNL is shown as approximately 50 percent of the CPU.
As both
the four- and eight-player games progress, the AoK process continues
to use more and more of the CPU. There is no downward pressure being
applied from other processes as there was for the 166MHz Pentium for
eight players.
If it
had not already been established that four players was the number of
players to support on the minimum system, these same statistics could
have been collected for a varying number of players. Then we could have
set the maximum number of players based on how many players could fit
within the memory footprint of 32MB.