Block Corruption
Block corruption is where a group of words in memory are corrupted more or less together. The block can be any size, but we are generally talking anything from four bytes to 1024 bytes.
The corruption data in the block may contain any combination of the types of corruption data found in a single word, as discussed previously. There are a few situations specific to block corruption.
Partial corruption
When the data in the block of memory covered is not entirely corrupt, just say every few bytes or words has been changed, then this is a good indication that we are dealing with a pointer to a data structure (a structure or class) that has gone astray.
The most likely explanation is a dangling pointer. The code is continuing to update some data structure that has already been freed.
Full corruption
If the block of corruption is contiguous and no byte within it remains unchanged (except for a few common bytes, like zero, that might exist frequently in both corrupt and correct data), then it seems like the data structure has either been initialized, reset, or copied from somewhere else.
Unit Vectors
A common arrangement of three floats is in a vector, and a common sub-group of vectors is the unit vector. Unit vectors are quite recognizable in memory, since they consist of three small floating point numbers (in the range -1.0 to +1.0), and so they frequently start with the hex digit 3 or B.
Here's an example of a unit vector sitting incongruously in the middle of a string.
Looking at the hexadecimal, it is not immediately obvious that anything is wrong. We can see however from the text display column that there seems to be some garbage bytes in the middle of the path name.
Looking more closely at the garbage bytes, viewed as words, we can see that two of then start with 3, and one starts with B - a very good indication that we are dealing with a vector of small numbers, possibly a unit vector.
We can then switch to a floating point view, which gives us:
This confirms the nature of the corruption. We have three floating point numbers in the range -1.0 to +1.0, we can do a quick calculation to confirm that if we square the numbers and add them it comes out at about 1.0, so the length of the vector is 1.0, a unit vector.
Causes and Effects of Corruption
Once you have determined the likely nature of the corruption, you need to identify the piece of code that caused the corruption. If you are not able to directly observe the corruption taking place, you may have to selectively instrument suspicious pieces of code.
To narrow down the field of pieces of code that might be considered, we should have a look at the most common direct causes of corruption, and examine how each cause manifests itself.
Buffer Overruns
A Buffer overrun is perhaps the most common type of bug. You often hear about “buffer exploits” in the hacking world. Here a programmer has neglected to check that the size of the input data fits into the destination space. The data overruns the buffer, and possibly overwrites some space used for code. By adding some appropriate code to the end of the data, an industrious hacker can inject some of his own code into an application and take control of it.
Buffer exploits are less of a security problem for game developers, unless they are accepting data over the internet. However buffer overflows are still a very significant cause of bugs.
Bad Pointers
If the value of a pointer is incorrect, then it can corrupt memory (as well as providing bad data to whoever uses that pointer). The value in a pointer can become “bad” in a number of ways.
Dangling Pointer - If a memory block is de-allocated or freed, yet some pointer still references that block (or an object within that block), then that pointer is said to be a “Dangling Pointer”. The value of the pointer has not changed, however the pointer has become bad since it no longer points to valid data.
Incorrect Pointer Calculation - The pointer could be generated using incorrect pointer arithmetic, or using other values that are themselves incorrect, causing the value of the pointer to be calculated incorrectly. Pointer arithmetic might also return a pointer out of range of the target buffer - a form of buffer overflow.
Corrupt Pointers - The memory in which the pointer is stored may itself have become corrupted due to some unrelated cause. Thus corruption can cause corruption, extending the chain of causes.
Bad Local Pointers
If a pointer is created to an object that has local scope, then that pointer will only be valid while that object is in scope. See Listing 1
Listing 1
void CheckThing(CThing *p_x)
{
CThing p_thing;
p_thing = *p_x;
if (ThingCheck(p_thing))
{
AddToList(p_thing));
}
}
Here a local variable p_thing is being used for some temporary purpose. However, during the course of the function the variable is added to some global list, then the function returns.
The result is that there is now a pointer in some list somewhere that points to memory that is used by the stack. This will not be an immediate problem, since when the function returns, then the stack pointer will recede higher in memory, leaving the instance of p_thing safely below the stack. Then one of two things might happen.
Object gets corrupted - the object pointer to by p_thing now no longer legally exists, however its binary image is still in memory, and code can continue to use it without problems until the stack once more descends below that location in memory. At that point the object may get corrupted. This, in a sense, is not a memory corruption bug, since the writes are legal, and in the correct place. But it behaves very like a corruption bug.
Stack gets corrupted - the object is in a list, and presumably some operations are going to be carried out with it. When the stack descends past this point in memory, then if that object is updated via the list, then updating the object will corrupt some memory that is legally being used by the stack. This could be a return address, it could be a saved register value, or it could be local variables in some routine higher up the call stack. Whichever it is, the effects will be deferred until the function call stack returns to that point, which could be quite distant from the cause of the problems.
|
Great job!
I suggest everyone having these problems to use a memory checking library like Fortify, ElectricFence, or even Valgrind. Personally, I initialize certain memory zones with 0xBEBACAFE ("drink coffee" in Spanish) instead of the famous 0xDEADBEEF.
- avoid explicit new / delete / malloc / free
- avoid "raw" pointers
Instead:
- use smart pointers
- use containers
In any case where you are using raw pointers or explicit allocation, make sure you have clearly defined (and documented) the responsibilities (who "owns" the memory, who is responsible to free it, when there is a raw pointer to a memory which is does not own, describe why can you be sure the pointer will be valid during its existence).
- Always initialize pointers with NULL
- Set pointers to NULL immediately after deleting or freeing them
- If a function returns allocated memory, there must be a reciprocal function that receives a pointer and frees it (create_node should have a destroy_node, for example)
- If a function does not allocate memory, it mustn't release it
Different strokes I guess :-)
A similar approach that can be used to track object state changes at code level can be found at:
http://blogs.msdn.com/gpalem/archive/2008/06/19/tracking-c-variable-state-change
s.aspx
Its an off-line technique for tracking the C++ objects. Similar to "break on Access" of memory, C++ templates can be used to implement "report on change" pattern for any member variable value changes. May not solve all problems, but can be useful often.
(1) Code Corruption, Jump Table Corruption, and Stack Overwriting Code are prevented by the memory protection hardware on most modern platforms. For PC and current consoles, it is typical for code space to be read-only, so the program will crash immediately if it tries to modify its own code. Most compilers also put v-tables and other jump tables (e.g. from switch statements) into a read-only code segment. You can also expect non-code to be non-executable on modern platforms, so wild jumps or overwritten return addresses have a good chance of being stopped dead by the memory protection hardware if they point into the data segment or stack. As the article mentions, stack-overflow on PC will usually hit a guard page, stopping the program before it destroys your debugging context. However, one processor which game developers have to deal with unfortunately has no memory protection at all: the SPUs of the PS3 Cell processor. SPU code can happily overwrite itself, jump to a data address and execute garbage, or overflow its stack.
(2) Faulty RAM on consoles - on my last project we discovered no less than separate devkits with faulty RAM problems of some sort. They were each exhibiting signs of memory corruption, and after the first one was discovered (it happened to be mine) we wrote a little test program to stress-test the console's RAM with a variety of test patterns. All our other devkits pass the test program, but these four failed its tests. In the first such kit (mine), the symptom was a 1-bit error in a particular offset into a 64 KB page of virtual memory -- different virtual address each time we ran it, but the bottom 16 bits were always the same -- so probably it was the same physical address underneath. I wasted a full day debugging the corruption before we caught on that the bug only happened on my kit, not my neighbor's. The code worked flawlessly on other kits. So we wrote that test program and proved that the 1-bit error had nothing to do with our program, but was just a hardware problem of some kind. After that, the first thing we did when memory corruption was suspected, was to run this test program on the affected kit. Over the course of several months, three other devkits failed its tests, so we knew immediately that it was a hardware problem and didn't waste our time trying to debug a non-existent bug. I know bad RAM sounds implausible, but I suppose we use our devkits heavily for years and years, and we don't always turn them off when we're not using them... anyways, for console developers, be aware of
the possibility of bad RAM when investigating strange corruption bugs.
(3) Put Magic Numbers in structures to aid debugging: For non-final builds, you can add a 32-bit int field to the start of each class/structure you use, and in the constructor or initialization function, set this to a recognizable ASCII code (which is different for each type of structure). E.g. you might set it initially to 'ANIa' or 'ANId' for animation structures, 'OBJx' or 'CHAR' for objects or characters, etc. This can make it easier to recognize structures in raw memory when you have to debug something like a memory corruption. Also, when the structure is freed you can change the magic value to something else, such as the same value with the case of each character toggled ('aniA', 'objX', etc.). Whenever it accesses these objects, your code can also assert that the magic number has the original value, helping you to catch any dangling pointer errors. This technique has some runtime overhead, but depending on your circumstances it might be worth it.
"we discovered no less than four* separate devkits with faulty RAM problems of some sort."