|
Try tracing through the code, stepping over functions, when you find one that does the corruption, and then next time around, dig into that function.
If you don't have this debugger functionality, then you can still roll your own by writing a little function that checks that memory location. You can then sprinkle calls to this function through your code, narrowing down the region of code that causes the error. If the memory location's address is dependent on the code, you may need to compile the code, note the new address, and then re-compile with the new address wired into the code.
Another manual method is to keep track of allocations that include that particular location. Memory corruption is often due to a dangling pointer that was once legal. So if you know the location of corruption in advance, then having a list of the callstacks of all the allocations that once owned that location can help quickly identify the culprit.
Identifying Hex Droppings
A memory location has been corrupted. Assuming you cannot quickly find what bit of code is responsible for the corruption, you can learn a lot about what that piece of code might be by examining the nature of the corruption.
Once you have identified the location that has been corrupted, then look at a hex dump of it in the debugger (or print out your own if a debugger is not available). A hex dump looks something like this.
The memory address is on the left, then comes the contents of memory, here listed eight bytes to a row, and then those eight bytes are repeated as ASCII characters
Single Bit Corruption
Few pieces of code will cause only a single bit to be flipped. The most likely candidate is a bit-field of flags.
Single Byte Corruption
If only one byte was modified, then that can narrow down the fields considerably. If the corrupt value is 0 or 1, then perhaps it is a byte flag.
Single 32-bit Word Corruption
A 32 bit word is often the fastest and most convenient way of storing data. It is the only way for certain data types such as floats or pointers (depending on your platform). Looking at the contents of the 32 bits will tell you something about the code that inserted that value there.
If you know that a 32 bit value is being corrupted, then you should view the memory location as a single 32 bit word, rather than as a sequence of four bytes. This removes any confusion with endianness, and makes the type of data much easier to recognize.
That said, it is also a useful skill to be able to recognize certain types of data as a byte stream, since the data may be intermingled (in a class) with other data of varying types. In the examples below we give the values both as a 32-bit integer, and as a four byte little-endian format, which harder to recognize than big-endian, since that is just the word with the bytes spread out.
Zero
Example:
00000000 or 00 00 00 00
Zero is easy to recognize. At first, you might not think there is much information in a zero, but consider the limited number of reasons a piece of code could be writing a single zero to a location in memory, and it may give some clue as to what piece of code might be responsible.
Zero is:
NULL - Perhaps the errant code is clearing a pointer? Some programmers make a habit of cleaning any pointer that is a member variable after they have deleted whatever it was pointing to (a reasonable practice to help prevent dangling pointers). However they might be doing it at the wrong time.
Zero. Both as an integer (0) and as a floating point (0.0f). Where in the code are individual values set to zero?
FALSE. Perhaps the code is treating the location as a flag, and simply setting it to FALSE.
The first value in an enum - Perhaps a type field, of a status field. What kinds of enumerations do you have in your suspect code? What does the first entry mean? What causes the code to write out the first value?
Clear and empty - Often data structures are initialized to zero. Does this happen anywhere in the code? Does the size of the data being cleared match the zeros in the corruption?
One
Example
00000001 or 01 00 00 00
One is also easy to recognize. Less common that zero, it can still tell you something about the code that wrote it there.
One is
TRUE - Perhaps it is being used as a flag. What could be set to TRUE?
An integer - Hence it's not a floating point number. You can discount code that stores floats.
Not a pointer - Odds are that the code causing the corruption is not thinking that it is storing a pointer, unless it is a secondary bug.
The first value of an enum - like any small number, it's possible it is an enumerated value, possibly a type number.
Floating Point Numbers
Example
3F800000 or 00 00 00 80 3F
Many floating point numbers have an easily recognizable format. A very common floating point value is the one shown above, 3F800000 is the hex representation of the 32-bit floating point value of 1.0. See Table 24.X for additional values.

Table 24.X
Notice how the small values start with a 3. A floating point number has the first bit being the sign, the next eights bit being the exponent, and the following 23 bits being the fractional part. Since numbers in the same range tend to have similar exponents, you can often recognize a group of floating point numbers of similar magnitude.
In games, a very common range for floating point values is from -1.0 to +1.0. These numbers are used extensively in unit vectors, transformation matrices, UV coordinates and scaling factors. Numbers in this range usually start with a 3 (for positive numbers), or a B (for negative numbers).
If you suspect it is a floating point number, you can then sometimes tell if it is an original (hard wired in the code) value, or a value arrived at by calculation. Consider the numbers above. The values 1.0, 2.0, 0.5, 100.0 all have trailing zeros in their hex representation. The value 3.3333334 also a sequence of AAAAAA in it.
By contrast, the less rational number 3.14159274 has what seems to be a random string of hex digits. We can see the degree of entropy in the hex number matches that in the floating point number.
So, a floating point number that has been the subject of some computations is much more likely to have random looking hex digits. Hence, you can tell if you are looking for code more like from an update function:
p->m_speed = sqrtf(p->m_speed*p_m_speed - 2.0 * g * h);
or from an initialization function
p->m_speed = 2.5f;
|
Great job!
I suggest everyone having these problems to use a memory checking library like Fortify, ElectricFence, or even Valgrind. Personally, I initialize certain memory zones with 0xBEBACAFE ("drink coffee" in Spanish) instead of the famous 0xDEADBEEF.
- avoid explicit new / delete / malloc / free
- avoid "raw" pointers
Instead:
- use smart pointers
- use containers
In any case where you are using raw pointers or explicit allocation, make sure you have clearly defined (and documented) the responsibilities (who "owns" the memory, who is responsible to free it, when there is a raw pointer to a memory which is does not own, describe why can you be sure the pointer will be valid during its existence).
- Always initialize pointers with NULL
- Set pointers to NULL immediately after deleting or freeing them
- If a function returns allocated memory, there must be a reciprocal function that receives a pointer and frees it (create_node should have a destroy_node, for example)
- If a function does not allocate memory, it mustn't release it
Different strokes I guess :-)
A similar approach that can be used to track object state changes at code level can be found at:
http://blogs.msdn.com/gpalem/archive/2008/06/19/tracking-c-variable-state-change
s.aspx
Its an off-line technique for tracking the C++ objects. Similar to "break on Access" of memory, C++ templates can be used to implement "report on change" pattern for any member variable value changes. May not solve all problems, but can be useful often.
(1) Code Corruption, Jump Table Corruption, and Stack Overwriting Code are prevented by the memory protection hardware on most modern platforms. For PC and current consoles, it is typical for code space to be read-only, so the program will crash immediately if it tries to modify its own code. Most compilers also put v-tables and other jump tables (e.g. from switch statements) into a read-only code segment. You can also expect non-code to be non-executable on modern platforms, so wild jumps or overwritten return addresses have a good chance of being stopped dead by the memory protection hardware if they point into the data segment or stack. As the article mentions, stack-overflow on PC will usually hit a guard page, stopping the program before it destroys your debugging context. However, one processor which game developers have to deal with unfortunately has no memory protection at all: the SPUs of the PS3 Cell processor. SPU code can happily overwrite itself, jump to a data address and execute garbage, or overflow its stack.
(2) Faulty RAM on consoles - on my last project we discovered no less than separate devkits with faulty RAM problems of some sort. They were each exhibiting signs of memory corruption, and after the first one was discovered (it happened to be mine) we wrote a little test program to stress-test the console's RAM with a variety of test patterns. All our other devkits pass the test program, but these four failed its tests. In the first such kit (mine), the symptom was a 1-bit error in a particular offset into a 64 KB page of virtual memory -- different virtual address each time we ran it, but the bottom 16 bits were always the same -- so probably it was the same physical address underneath. I wasted a full day debugging the corruption before we caught on that the bug only happened on my kit, not my neighbor's. The code worked flawlessly on other kits. So we wrote that test program and proved that the 1-bit error had nothing to do with our program, but was just a hardware problem of some kind. After that, the first thing we did when memory corruption was suspected, was to run this test program on the affected kit. Over the course of several months, three other devkits failed its tests, so we knew immediately that it was a hardware problem and didn't waste our time trying to debug a non-existent bug. I know bad RAM sounds implausible, but I suppose we use our devkits heavily for years and years, and we don't always turn them off when we're not using them... anyways, for console developers, be aware of
the possibility of bad RAM when investigating strange corruption bugs.
(3) Put Magic Numbers in structures to aid debugging: For non-final builds, you can add a 32-bit int field to the start of each class/structure you use, and in the constructor or initialization function, set this to a recognizable ASCII code (which is different for each type of structure). E.g. you might set it initially to 'ANIa' or 'ANId' for animation structures, 'OBJx' or 'CHAR' for objects or characters, etc. This can make it easier to recognize structures in raw memory when you have to debug something like a memory corruption. Also, when the structure is freed you can change the magic value to something else, such as the same value with the case of each character toggled ('aniA', 'objX', etc.). Whenever it accesses these objects, your code can also assert that the magic number has the original value, helping you to catch any dangling pointer errors. This technique has some runtime overhead, but depending on your circumstances it might be worth it.
"we discovered no less than four* separate devkits with faulty RAM problems of some sort."