Implementation
The previous articles already covered a
lot of implementation details for Ps2 and Xbox that mostly still
apply to this version. The articles cover:
-
How to intercept all allocations.
-
How to retrieve callstack
information.
-
How to read symbol information.
In this part we will revisit each of
these subjects for PC. Some solutions may also apply to Xbox
development, but they have only been tested on PC.
We will start off with a very powerful
way to intercept all allocations. Then, an improved version of the
StoreCallStack
function from the previous articles is presented. Finally, we will
again look at symbol loading - this time we are taking it a step
further.
The tool is built with the .NET
framework using C#. When applicable, C# details are presented.
Intercepting all
allocations
In order to get a good view of your
memory usage, we need to try to intercept as many allocations as
possible. A good place to start is by overloading operator new
and delete.
While this does intercept all of the new
and delete
allocations in your game, this does not cover the allocations that
are being performed by direct calls to malloc,
LocalAlloc,
GlobalAlloc
or HeapAlloc.
A large chunk of your game’s memory footprint may be performed by
DirectX, which uses the lower-level functions to allocate its
resources. As you may know, the PC does not support functions to
easily intercept those lower-level allocations. Xbox has XbMemAlloc,
which simplifies allocation tracking a great deal. On PC, it seems as
if the best you can do to intercept allocations is to overload the
new
and delete
operator.
But, there are actually ways of
intercepting all of the allocations, but the solution is far from
obvious. We can hook into the DLLs containing the low-level
allocation functions. The great thing about this approach is that
because all modules from the running process are using the DLL that
is going to be patched into, literally all of the allocations of the
running process are intercepted.
The methodology of hooking into DLLs is
quite complex but thankfully there are people that have already
encapsulated this functionality for us in a library. The first one is
Detours, by Microsoft Research3. In order to get up
and running quickly, take a look at the tracemem sample. This
sample hooks into the HeapAlloc
function. This library is free for non-commercial use.
A second solution is to use a library
from Matt Conover4. At the moment of writing this library
seems a bit more difficult to get started with, but there are two
pros to this library: first of all it is free, even for commercial
use, and second: Matt seemed to be very helpful in getting his
software to run properly. If you want to get up and running with this
library, look at the HeapHook sample included in the package.
Unfortunately, that is not all there is
to it. We are treading dangerous grounds with this approach. We have
to be very much aware that all of the allocations that are
being performed will now pass through our allocation functions. That
means that if a function that allocates memory from the heap is
called during our heap callback, we may end up in an endless loop.
This will surely be the case if you are using sockets. Sockets aside,
you will be surprised how often allocations are being performed. You
may suggest that doing a check for recursion in the heap callback
easily prevents this. This is true, but let’s take a look at an
even more dangerous scenario.
Imagine that some thread is using
sockets – probably as a result of an asynchronous operation - and
right in the middle of that socket operation, an allocation is
performed. Our allocation function will be called and our own socket
operation will be started. This generally horribly messes up the
internal state of the socket library. Checking for recursion in the
callback function won’t help you this time, so we need another
solution to this problem.
Too illustrate that we are indeed
treading dangerous grounds, I would like to share what happened with
my first attempt to solve this problem. My idea was to check what
kind of module called the allocation function. If it were one of the
socket DLLs, no socket operations would be performed. While I wasn’t
all too thrilled about this solution, it also didn’t seem to work.
I tried to match the return address in our callback to the address
ranges of the socket process. In very specific cases when I used the
DbgHelp library to obtain information about the socket process, the
application crashed. It turned out that during the loading of DLLs,
allocations were being performed. DbgHelp did not appreciate us
retrieving information about processes during the loading of a new
process, so I had to search for another solution.
The one thing that solves our problems
is fortunately relatively simple: we need to use a worker thread for
our communication with the tool. When allocations are performed, our
metadata is collected and stored into a buffer, and our worker thread
is kicked. The worker thread will send the data to the client. This
will always break any form of recursion. This makes it even possible
to monitor the allocations that are being performed by our own
sockets, if desired.
We are close now but there is one more
thing to watch out for. It may be possible that deadlocks occur due
to the chain of events that are displayed in figure 7. This situation
can be avoided by using double buffering, and by protecting the
double buffering logic with critical sections – but just the double
buffering logic. The allocation callback should never need to wait
for the send operation to complete. The downside to this approach is
that the databuffer needs to be large enough to contain data of a
series of allocation events. On the positive side, this form of
batching may speed up the send process.
Figure
7: The chain of events that may cause a deadlock situation.
As you may understand by now, this
methodology may not be for the faint hearted. It may be good to start
with overloading operator new
and delete.
For best understanding how to overload new
and delete
properly, I would advice to read chapter 8 from Scott Meyer’s
Effective C++ 10.
|