Some performance issues from AoE needed to be resolved while we were working on AoK, the biggest of which was AoE's 2D graphics pipeline. The graphics for AoK are created through a combination of software rendering and hardware composition. This pipeline had been highly optimized for AoE by hand-coding most of the system in Assembly, so there was not much additional need to optimize it for AoK.
But there were new features to integrate into the 2D pipeline. For one thing, AoK had more detailed terrain. Also, units that were visually obscured behind buildings and other obstructions appeared as outlines so players could see where they were. Both of these systems were implemented as a mixture of C/C++ and hand-coded Assembly during implementation.
The biggest challenge in keeping the performance up for the graphics system was making sure that the sprites used for graphics in the game were properly tagged as belonging in system memory or video memory. If a sprite was in the wrong memory type a significant performance hit or even an error could occur, but it was usually hard to identify these graphics memory location problems. They were usually marked by a drawing problem on-screen, such as a shadow drawing on top of a unit instead of under it.
Sprites used by the software rendering engine needed to be in system memory so that they could be read and processed. If they resided in video memory instead, the limited throughput from video memory caused a significant performance hit to the game. Conversely, sprites bltted by the hardware that accidentally ended up in system memory would render slowly and could fail to render at all if the hardware bltter didn't support blts from system memory.
Pathfinding problems from AoE also had to be fixed. In AoE, there was a single unit-pathing system, which was known as "tile pathing" because it broke the game map down into individual tiles and classified them as "passable" or "nonpassable." This tile-pathing system was fairly good at moving units short distances, but it often took too long to find paths (if it could find one at all), so we created two additional pathing systems for AoK.
The first of these two systems, "MIP-map pathing," quickly approximated distant paths across the map. The basis for MIP-map pathing was the construction of compact bit vectors that described the passability of each tile on the map. This system allowed the game to determine quickly whether it was even possible for a unit to get from its current location to the general target area. The only way to determine whether the general area could be reached was through the resolution of the bit vectors.
Once a unit was within a short distance of its target area, another new pathing system called "low-level pathing" was used. Low-level pathing allowed very accurate pathing over short distances. When low-level pathing failed, the pathing system fell back and used the original tile pathing from AoE.
Changing the pathing system from a single, general-purpose system to three special-purpose systems improved the performance of AoK and also significantly improved game play since it virtually eliminated the problem of stuck and stopped units caused by pathing failures.
While we were able to improve the pathing system for AoK, enhancing the unit-class hierarchy system was a much more onerous task. The unit-class hierarchy system from AoE couldn't be changed easily since so many game systems and so much functionality relied on the old implementation. At its heart, the game's unit-class system is a hierarchy of derived classes and each derived class is more specialized than its parent. The special functions of each derived class are supported by virtual functions exposed by the classes in the hierarchy. A simplified version of the hierarchy is shown in Figure 4.
Figure 4. AoK unit class hierarchy
From a programming standpoint, calling a virtual function consumes no more overhead than a regular class function.
If each class could implement only its own version of the virtual functions, then this hierarchy wouldn't cause any function overhead problems. However, since each level of the hierarchy implements its own special code, it must also call its parent's version of the derived function to perform its work. In a hierarchy four classes deep, that means calling three additional functions. This may not sound like much, but it can add up when code is executed hundreds of thousands or millions of times.
Some performance improvement could have be gained by circumventing the hierarchy using "special case" units. For example, walls are a type of building unit that do not attack other units and only need to scan their vicinity for enemy units every few game updates unless they are under attack. To handle this special case, we could specifically check whether the current unit being processed is a wall, and if so, skip the code that is only executed for other buildings. Unfortunately, coding in too many special cases can also lead to performance losses, because you end up checking to see whether a unit is one of your many special cases. In the end, we left unit-class hierarchy in place, and made specific changes to shortcut to functionality that didn't apply to specific units.
Commercial Profiling Tools: The Good, the Bad, and the Ugly
Performance analysis extends beyond evaluating the execution speed of program functions and subsystems. It also includes measuring memory usage and evaluating the way the program interacts with other programs and the operating system. In order to determine where the performance problems were in AoK, four separate tools were used: Intel's VTune, NuMega's TrueTime, the Windows NT performance counters, and our own profiling and memory instrumentation code.
Although we used Microsoft Visual C++, we did not use the bundled Microsoft Profiler. There were two reasons for this: we found it difficult to get the Microsoft product to work correctly (or at all) and the data format from their profiler was either inadequate or needed post-processing in a spreadsheet to be minimally useful. Using VTune, TrueTime, and the NT performance counters we were able to collect, analyze, and present data in a reasonable fashion.
VTune is a sampling profiler, which means it has a component that wakes up every few milliseconds (or whatever amount of time you specify) and looks at what processes are executing on the CPU(s). When you decide enough time has elapsed, you can stop VTune and look at the statistics it produces for each process executed during that time. If you've compiled your program with debug information, VTune can display which lines of code were called and what percentage of the elapsed time was consumed by the executing code.
VTune is great because you don't need to compile a special version of your program, it doesn't slow your program down while it runs, and it lets you see the amount of time the CPU spent executing processes besides your own. The only major drawback is that you can end up with spurious data due to this sampling. This can be caused by other processes that are running in the system, or by running VTune for too long a period. To improve VTune's accuracy on your own program, it comes with an API to turn VTune on and off programmatically. This is a very useful feature, especially when drilling down into the performance of specific subsystems and smaller sections of code.
We found that VTune's call-graph functionality couldn't be used with a program that linked either explicitly or implicitly with DirectDraw. Also, some applications (including AoK) were too large in terms of code and debug information in order for VTune to resolve its data back correctly to a line of code. It seems that some of these problems have been fixed in VTune 4.5, however.
Another commercial product that we used was NuMega's TrueTime, which is an instrumenting profiler. To use this product, you have to make a special TrueTime compilation of your program that inserts timing code into each module. This can sometimes be a slow build process, but it's worth it. As the TrueTime build of your program runs, TrueTime logs which functions are entered, when they are entered, and when they are exited. This process can be significantly slower than VTune's effectively real-time performance but it's a useful second opinion nonetheless. The only big drawback (and it can be very severe) is that TrueTime can slow down your program so much that it's impossible to use it for profiling network code. This problem can also skew profiling statistics for time-based game actions such as AI or updates that are scheduled to occur at a certain interval of time.
This performance hit from TrueTime also made it impractical to use it to analyze the performance of the graphics subsystem. When system performance relies on two independent processors (such as the main CPU and the graphics card), efficient cooperation between both processors is critical so that they run concurrently and perform operations in parallel. When TrueTime slowed the CPU (and consequently the AoK rendering load which the CPU governed), it made the graphics card appear to give better performance than it actually did.
There were four drawbacks to both programs. First, neither program can be run in batch mode, so the programmer has to baby-sit the programs while they run through each performance test case. Even though we worked on performance test cases one at a time, it would have been convenient to run each program in batch mode overnight to gather results from other test cases. VTune has since added a batch interface in version 4.5 but support is still lacking in TrueTime.
Second, performance numbers gathered during the execution of a program need to be taken with a grain of salt. Due to the multi-threaded nature of the Windows operating system, other programs (including the performance tool itself) are effectively running at the same time, and that can skew performance. Fortunately, multiple performance runs with the same tool or with different tools can help identify specific problem areas by correlating all of the results, and analyzing performance over smaller sections of code can improve accuracy and reduce the time required by some performance tools.
The third drawback to these profilers is that it's difficult to use both TrueTime and VTune together when using Visual C++ as your development environment. TrueTime cannot instrument code from Visual C++ with VTune installed because VTune renames certain underlying compile and link programs.
Finally, although both tools display call graphs, we found it difficult at times to ascribe performance timings to specific subsystems. For instance, pathing was sometimes called from both movement and retargeting code, but we were not able to determine which subsystem was using more of the pathing code. TrueTime was generally accurate about this, but in some cases, the numbers it reported just didn't seem to add up. In this type of case, we had to place our own timing code directly into AoK to get complete results.
Regardless of how good today's profiling tools are, they have no understanding of or insight into the underlying program they profile; profiling and analysis tools would be significantly more useful if they had some understanding of what the application was attempting to accomplish. With that kind of additional functionality, these tools could provide performance statistics that would greatly enhance the programmer's ability to improve the application performance. Until that day arrives, you'll have to add profiling and analysis code to your application for even the most mundane performance information beyond simple timings and call graphs.
Performance on the Minimum System
Since performance statistics can change based on the platform on which the application is running, it was especially critical to get computer systems that matched the minimum system specification. To demonstrate this performance differential and the scalability of AoK, two test cases were run on the minimum system configuration and one was run on a regular development workstation (Figure 5). To contrast the data as much as possible in this example, the first test case uses the maximum AoK settings for players (eight) and map size (giant). The second test case conforms to the game settings for the minimum system configuration: four players on a four-player-sized game map.
Figure 5. Performance Analysis.
Test PC 1
Test Case 1
Test PC 2
Test Case 2
|166 MHz Pentium||60 seconds of game play||Dual 450 MHz Pentium III||60 seconds of game play|
|32 MB RAM||Eight-player game||128 MB RAM||Four-player game|
|S3 Virge GX||Giant map, largest map available||Nvidia TNT2 Ultra||Four-player map size|
|Windows 98||One civ from each of the four civ art setse||Windows 2000||All civs share same div art set|
Using VTune, the percentage of CPU clock cycles spent in each process during an AoK game was calculated for a 60-
second period at 30-minute intervals. This was done on the 166MHz Pentium minimum system (Figure 6), and on a dual 450MHz Pentium III development workstation (Figure 7).
6 (left). Four-player and eight-player game CPU process utilizatin
As you can see, the four-player game performs well on the 166MHz Pentium. The AoK process starts at approximately 60 percent of the CPU and increases to about 75 percent after 30 minutes. The additional time devoted to the virtual memory manager (VMM) process at startup is caused by AoK data as it is paged in and accessed for the first time. In contrast, the amount of CPU time used by AoK in the eight-player game degrades over time. This is due to the additional memory requirements to support so many players and such a large game map. The CPU reaches the point where it's spending almost as much time managing the virtual memory system as actually executing the game itself.
Since the development workstation (Test PC 2) is a dual-processor system and AoK is single-threaded, the second CPU is idle as the kernel runs. This is why the NTOSKRNL is shown as approximately 50 percent of the CPU.
As both the four- and eight-player games progress, the AoK process continues to use more and more of the CPU. There is no downward pressure being applied from other processes as there was for the 166MHz Pentium for eight players.
If it had not already been established that four players was the number of players to support on the minimum system, these same statistics could have been collected for a varying number of players. Then we could have set the maximum number of players based on how many players could fit within the memory footprint of 32MB.