Investigating Corruption
If you suspect that memory corruption is occurring, then your first step is to try to determine if this is actually some form of corruption, and what type it is.
Is it actually corruption?
Just because a value in memory looks rather unusual, does not mean that it was not generated by the code that owns that memory. The unusual value might simply be the result of an error in your logic. It could have been quite legally copied from somewhere else. It could be the result of computations involving incorrect data, perhaps data that was already corrupt.
To determine this, you need to determine if the code that you might think is writing to that location actually is writing to that location, and see what values it is writing. Ideally, you would add assertions at all location that you think might legally be writing to that location, and check the range of values that are being written (make sure the “corrupt” value is outside this range.)
Who owns that memory location?
Memory corruption usually occurs when some piece of code is using an area of memory that it should not. The corrupt memory then causes problems in some code
There are two primary ways in which this can happen: corrupting a legal area, and using an area illegally.
Consider a piece of code A, that uses and area of memory A(m). If another piece of code, B, also happens to have a pointer to A(m), and writes some data to that, then code B is corrupting memory A(m). This is the normal form of memory corruption.
Now consider if the code A is legally using memory location A(m). Code B is illegally also using some location within (or overlapping) A(m). Code B appears to work correctly, but then code A makes a legal update to A(m), causing code B to manifest a bug. It appears that code A is corrupting memory B(m). However, the fault here is with code B. It has the appearance of corruption, yet may mislead you to thinking that the problem is with code A.
It is important to determine who actually owns the memory location that is being corrupted. Is the “legal” use actually legal? Can you demonstrate that code B actually owns those memory locations? If you can quickly determine that code B does not actually own that memory, then the tracking down of code A is irrelevant, which can save you substantial time.
Repeatable, Fixed Location
If the corruption is consistent, meaning it happens in the same location and under the same conditions then you are (relatively speaking) in luck. Debugging in this case is a matter of somehow watching that location, and tracking down the cause of the corruption. Since the corruption happens under the same conditions, you should be able either to trap it immediately, or quickly narrow down the possibilities.
Intermittent, Fixed Location
If the corruption happens in the same memory location, yet is intermittent, then this makes tracking down the corruption more difficult. Since you do not know when the corruption occurs, you cannot be as focused in your search, and must rely more on general observation as to the nature of the corruption when tracking down the cause.
Intermittent, Variable Location
If the corruption happens in varying location, and at unpredictable times, then your debugging options are often limited to making observations about the corruption after it has occurred.
Determine the location of the corruption
If the memory corruption is the immediate cause of the bug, such as with an address error due to a corrupt pointer, then you may be able to immediately determine the effect of the corruption simply by seeing what address was being accessed at the time of the bug.
If the memory corruption is an intermediate cause, then you will track down the address of the corruption in the process of analyzing the immediate cause of the bug, and any intermediate causes that lie between the root cause and the symptoms.
Hardware Breakpoints
If your target platform has some kind of break-on-access breakpoint, then use this as your first line of investigation when debugging memory corruption with a know address. Simply set the debugger to execute a breakpoint when a memory location changes, then when the location happens, see what code is executing.
This technique can work very well if the location that is being corrupted contains data that is relatively static. However, if the location contains some dynamic variable that changes hundreds of times per frame, then you may have some difficult in finding the single write access that is causing the problem.
In that case, you may be able to augment your write-access breakpoint with a conditional check that verifies that the data being written to the corrupt location is in the valid range.
Sometimes memory is corrupted with vales that are within the valid range, but nonetheless are wrong. Your options here are more limited:
- Repeatedly run the code, and each time the breakpoint trips, look at the call stack until you see something that you do not recognize as code that can legally write to this location.
- If the legal places that write to this location are known and relatively limited, then update them to first write to some separate location. First ensue the corruption does not also affect that separate location, and then update the breakpoint condition to check the value written matches the stored value.
|
Great job!
I suggest everyone having these problems to use a memory checking library like Fortify, ElectricFence, or even Valgrind. Personally, I initialize certain memory zones with 0xBEBACAFE ("drink coffee" in Spanish) instead of the famous 0xDEADBEEF.
- avoid explicit new / delete / malloc / free
- avoid "raw" pointers
Instead:
- use smart pointers
- use containers
In any case where you are using raw pointers or explicit allocation, make sure you have clearly defined (and documented) the responsibilities (who "owns" the memory, who is responsible to free it, when there is a raw pointer to a memory which is does not own, describe why can you be sure the pointer will be valid during its existence).
- Always initialize pointers with NULL
- Set pointers to NULL immediately after deleting or freeing them
- If a function returns allocated memory, there must be a reciprocal function that receives a pointer and frees it (create_node should have a destroy_node, for example)
- If a function does not allocate memory, it mustn't release it
Different strokes I guess :-)
A similar approach that can be used to track object state changes at code level can be found at:
http://blogs.msdn.com/gpalem/archive/2008/06/19/tracking-c-variable-state-change
s.aspx
Its an off-line technique for tracking the C++ objects. Similar to "break on Access" of memory, C++ templates can be used to implement "report on change" pattern for any member variable value changes. May not solve all problems, but can be useful often.
(1) Code Corruption, Jump Table Corruption, and Stack Overwriting Code are prevented by the memory protection hardware on most modern platforms. For PC and current consoles, it is typical for code space to be read-only, so the program will crash immediately if it tries to modify its own code. Most compilers also put v-tables and other jump tables (e.g. from switch statements) into a read-only code segment. You can also expect non-code to be non-executable on modern platforms, so wild jumps or overwritten return addresses have a good chance of being stopped dead by the memory protection hardware if they point into the data segment or stack. As the article mentions, stack-overflow on PC will usually hit a guard page, stopping the program before it destroys your debugging context. However, one processor which game developers have to deal with unfortunately has no memory protection at all: the SPUs of the PS3 Cell processor. SPU code can happily overwrite itself, jump to a data address and execute garbage, or overflow its stack.
(2) Faulty RAM on consoles - on my last project we discovered no less than separate devkits with faulty RAM problems of some sort. They were each exhibiting signs of memory corruption, and after the first one was discovered (it happened to be mine) we wrote a little test program to stress-test the console's RAM with a variety of test patterns. All our other devkits pass the test program, but these four failed its tests. In the first such kit (mine), the symptom was a 1-bit error in a particular offset into a 64 KB page of virtual memory -- different virtual address each time we ran it, but the bottom 16 bits were always the same -- so probably it was the same physical address underneath. I wasted a full day debugging the corruption before we caught on that the bug only happened on my kit, not my neighbor's. The code worked flawlessly on other kits. So we wrote that test program and proved that the 1-bit error had nothing to do with our program, but was just a hardware problem of some kind. After that, the first thing we did when memory corruption was suspected, was to run this test program on the affected kit. Over the course of several months, three other devkits failed its tests, so we knew immediately that it was a hardware problem and didn't waste our time trying to debug a non-existent bug. I know bad RAM sounds implausible, but I suppose we use our devkits heavily for years and years, and we don't always turn them off when we're not using them... anyways, for console developers, be aware of
the possibility of bad RAM when investigating strange corruption bugs.
(3) Put Magic Numbers in structures to aid debugging: For non-final builds, you can add a 32-bit int field to the start of each class/structure you use, and in the constructor or initialization function, set this to a recognizable ASCII code (which is different for each type of structure). E.g. you might set it initially to 'ANIa' or 'ANId' for animation structures, 'OBJx' or 'CHAR' for objects or characters, etc. This can make it easier to recognize structures in raw memory when you have to debug something like a memory corruption. Also, when the structure is freed you can change the magic value to something else, such as the same value with the case of each character toggled ('aniA', 'objX', etc.). Whenever it accesses these objects, your code can also assert that the magic number has the original value, helping you to catch any dangling pointer errors. This technique has some runtime overhead, but depending on your circumstances it might be worth it.
"we discovered no less than four* separate devkits with faulty RAM problems of some sort."