To complement and augment the results from the commercial profilers we were using, we developed an in-house profiling tool as well. This tool logged the execution time of high-level game systems and functions (telling us how much time was spent by each one) and told us the absolute amount of time a section of code took to execute - a sanity check for performance optimizations that we sorely needed. Our profiling system consisted of four simple functions that were easily inserted and removed for profiling purposes and relied on a simple preprocessor directive, _PROFILE, that compiled the profiling code in or out of the executable. This let us keep our profiling calls in the code, instead of forcing us to add and remove them to create nonprofiled builds. You can download an abbreviated example of the profiling code from the Game Developer web site (http://www.gdmag.com).
While VTune told us how much of the CPU AoK was using (Figure 6), our custom profiler told us how much time was being spent on each of AoK's major subsystems (Figure 8). This additional information told us interesting things about the performance of AoK and where we might be able to improve performance. You can see in Figure 8 that the amount of time devoted to game world simulation and unit AI increases from approximately 33 percent to approximately 57 percent of the AoK process over the course of three samples at 30-minute intervals during an eight-player game.
Looking back at the process statistics from VTune in Figure 6, you see that the amount of time spent in the VMM increases while the time spent on AoK decreases. Since AoK spends more time in simulation/AI and the operating system spends more time manipulating virtual memory, we can propose some theories to explain this:
More data would be required to determine the cause of this problem. It would also be good to break the "simulation/AI" group down into more discrete components for timing.
Figure 8. An eight-player breakdown of internal AoK performance (Pentium-166).
Our timing code relies on the Assembly instruction ReadTimestampCounter (RDTSC), but it could also have used the Win32 QueryPerformanceCounter, or another fine-grained counter or timer. We chose RDTSC because it was simple to use, it works on all Pentium (and later) processors (except some very early Cyrix Pentium-class parts), and these profiling functions were based on extending existing code.
Finally, although both tools display call graphs, pathing was sometimes called from both movement and retargeting code but we were not able to determine which subsystem was using more of the pathing code. TrueTime was generally accurate about this, but in some cases the numbers it reported just didn't seem to add up. In this type of case, we had to place our own timing code directly into AoK to get complete results.
As I stated earlier, it was difficult to assign performance timings to specific subsystems based on the results of the commercial profilers that we used. To remedy this, we built functionality into our custom profiler to determine how much of each system's time was spent in, say, pathing. Here's how our profiler works. During profiler initialization (_ProfInit), the static array of profiling information (groupList) is initialized to zero, and the CPU frequency is calculated. The size of the groupList array matches the number of profile group entries in the ProfileGroup enum list in the prof header file. The CPU frequency is calculated with a simple, albeit slow, function called GetFrequency. (Alternately, this could have used one of the specific CPU identification libraries available from Intel or AMD, but this code works transparently on Windows 95/98 and NT and across processors without problems.)
The final part of initialization seeds each groupList array entry with its parent group. Since the groupList array entries match the ProfileGroup enums in order, the ProfileGroup enum can be used as an index into the groupList array to set the parent group value. Using the SetMajorSection macro significantly simplifies this code by encapsulating the array offset and assignment. More importantly, it allows us to use the stringizing operator (#) to record the parent group's ProfileGroup declaration as a string (const char *) for use when formatting our output.
The second requirement for our custom profiler was that its profiling code had to be smart enough to make sure that the profiling start (_ProfStart) and stop (_ProfStop) statements were inserted around a function or group of functions in correct pairings. The _ProfStop function first makes sure that profiling was started, and at that point the current time is recorded. This is then used to calculate and store the elapsed time. The number of calls made is incremented, and the starting time is reset to zero. We wanted to avoid the situation where profiling starts multiple times on the same group, or where a _ProfStop appears before its corresponding _ProfStart. To ensure the correct pairing of profiling statements, in _ProfStart a check is made to ensure that the function has not already been called by examining the starting timing value mqwStart. The current time is then recorded into mqwStart using the GetTimeStamp function, a wrapper for RDTSC.
In GetTimeStamp, it should be noted that the eax and edx registers are used for returning the current 64-bit timing value as two 32-bit values, which are subsequently shifted and combined. In this case, there is no need to push and pop the scratch registers since the compiler is smart enough to recognize the inline Assembly use. However, if this timing code was encapsulated in a macro, there's the chance that the compiler might not recognize it and it would be necessary to push and pop the registers.
Another issue we confronted with our custom profiling system was the accuracy and resolution of timing available from a system that uses two function calls from the calling code (first to _ProfStart and then to GetTimeStamp). Since we use this timing code to profile larger subsystems and functions, there will be timing variations due to system factors, such as the execution of other processes by the operating system. If we time significantly smaller portions of code, down to a few lines, it's preferable to inline the RDTSC call or use it within a macro.
Using the RDTSC as a high-resolution timer can present another problem, too. Note that RDTSC is not an instruction that will serialize the execution inside the CPU. In other words, RDTSC can be rescheduled just like any other CPU instruction and may actually be executed before, during, or after the block of code you're attempting to time. Using a fencing (serializing) instruction like CPUID can solve this.
At the end of the program, the _ProfSave function saves the recorded profiling information out to a file. The name of the group, the number of calls, the elapsed time spent in the group, the average time per call, its percentage of its parent group, and the parent group name are listed for each profile group. This output is formatted out using the complicated proftrace macro, which once again uses the stringizing operator (#) to print out the character version of the profile group followed by its information.
Next month we'll wrap up talking about our profiling tools by discussing the memory instrumentation we created for AoK. Then, we'll take an in-depth look at a number of performance issues facing AoK, including unit movement and pathing, and see how they were addressed.
For More Information
|Intel VTune and C/C++ Complier||http://developer.intel.com/vtune|
|MicroQuill HeapAgent and SmartHeap||http://www.microquill.com|
|Performance Analysis and Tuning||
Baecker, Ron, Chris DiGiano, and Aaron Marcus. "Software Visualization for Debugging." Communications of the ACM (Vol. 40, No. 4): April 1997
Marselas, Herb. "Advanced Direct3D Performance Analysis" Microsoft Meltdown Proceedings, 1998
Marselas, Herb. "Don't Starve that CPU! Making the Most of Memory Bandwidth." Game Developers Conference Proceedings, 1999.
Pottinger, Dave. "Coordinated Unit Movement." Game Developer (January and February 1999).
Shanley, Tom. Pentium Pro and Pentium II System Architecture, 2nd ed. Colorado Springs, Colo.: Mindshare Inc., 1997
Acknowledgements: Creating and optimizing AoK was a team effort. I'd like to thank the AoK team, and specifically the other AoK programmers, for help in getting the details of some of that effort into this article. I'd also like to thank everyone at Ensemble Studios for reviewing this article.
Herb Marselas currently works at Ensemble Studios. He helped out on Age of Empires II: The Age of Kings. Shhhh! Please don't tell anyone he's working on a secret 3D-engine project called [deleted]. Previously, he worked at the Intel Platform Architecture Lab where he created the IPEAK Graphics Performance Toolkit. You can reach him at [email protected].