[In an in-depth technical article, Neversoft co-founder Mick West discusses memory corruption in games, pinpointing the symptoms, causes, and solutions for game glitches and crashes caused by the tricky problem.]
Definition: Memory corruption is an unexpected change in the contents of a memory location.
The symptoms of memory corruption can range from hard crashes, all the way through minor glitches, to no symptoms at all. The causes of memory corruption are many and varied, and include memory corruption itself. In this article I attempt to classify the various ways in which memory corruption can manifest itself, the various causes, and some ideas for identifying the root causes of various types of memory corruption. I'll cover:
Given that memory corruption can manifest in almost any way, it seems redundant to list all the symptoms. However, different symptoms of memory corruption are indicative of different causes of corruption. Sometimes we can also gather valuable clues from the type of symptom, which might lead us closer to the cause of the corruption.
Crash bugs come in all flavors, but memory corruption can cause just about all of them. The way in which the game crashes can give you valuable clues as to what type of memory corruption is occurring. These clues can indicate where you need to start looking for the cause of the crash.
An address error indicates that a pointer has been modified to point to an illegal address. This could be an address that is: not word aligned, NULL, or an address that points to unmapped or protected memory.
Address errors are quite helpful, since program execution stops when an address error is encountered, and it is quite easy to enter the debugger, and determine the address and corrupted contents of the pointer variable that is being used.
Corruption of data can make a loop fail to terminate. Take for example code that traverses a linked list. Memory may be corrupted in such a way that the list contains a loop. Since the code expects the list to terminate with a NULL value at some point, it simply carries on around the loop forever.
This behavior is a lot more likely with a list that uses indexing instead of pointers, but it is still possible with pointer. Consider the implication of memory being corrupted in such a way that a list gets a pointer corrupted so that the list is now circular. It is very unlikely that some random corruption, or some unrelated code would happen to stick a semi-valid pointer in the right place. Hence, it is more likely that the corruption was something in the list code itself.
An illegal instruction could mean one of several kinds of memory corruption.
Stack Corruption - If the stack has been corrupted in some way, this can lead to an incorrect return address, which ends up pointing to illegal code. This is the most common way buffer overruns are exploited by hackers.
Jump Table Corruption - if a v-table (or any kind of table of jump-addresses) is corrupted, then the PC can end up pointing at an illegal instruction.
Code Corruption - The code itself can be corrupted by a bad pointer corrupting sections of the code. This type of corruption can be very hard to detect if the code that has been corrupted is not executed very often.
Stack Overwriting Code - A special kind of code corruption. Runaway recursion can sometimes run unchecked until the stack overwrites the routine that it is executing. This shows up nicely in the debugger in a hex window.
Function Pointers - Since function pointers are sometimes stored in data structures, and passed around like regular variables, then they can be corrupted just like any other variable. This can eventually lead to the program executing incorrect code.
If you have a variable of some kind that normally has a value in a certain range, and you unexpectedly find that the variable contains some ridiculous value, then this may be due to memory corruption.
Wildly unusual values often have noticeable effects, such as the player teleporting to the end of the universe, or a model being scaled infinitely large.
Less severe corruptions can occur, for example, a counter might simply be reset to zero or even just changed slightly. This type of corruption can be difficult to track down, as it may not produce especially noticeable effects.
Here a good testing department is invaluable. If the testers can notice little inconsistencies like this, then you will catch potentially harmful bugs at a much earlier stage.
Since the location of the corruption of memory is often somewhat random, then the problem may go undetected for some time. This may give the false impression that the existing code is solid. Upon adding new code or data, the bug may reveal itself, causing you to think that the new code has caused the bug, when in fact the new code has only cause memory to be slightly re-ordered into a configuration that reveals a pre-existing bug.
Since memory often contains graphical data, then if memory is being corrupted, it may show us as some corruption in graphics. The way this is manifest will depend on the nature of the graphics, and the nature of the corruptions.
Changes in color of a single pixel, or a very short row or column of pixels, indicates that a pointer to a variable has acquired a wrong value, perhaps the result of earlier memory corruption.
Changes to large swathes of a texture indicate either an incorrect pointer, or some kind of buffer overrun. Corruption that looks like a regular patter, often containing vertical or diagonal stripes indicates some kind of array exceeding its bounds, or one that is now at an incorrect address due to a corrupt pointer.
Corruption in a texture that resembles a squashed or discolored version of another texture indicates that you might be overwriting the texture with another one of different dimensions or different bit depth.
If the corruption is static (unchanging), then it indicates a one-time event, where a pointer was misused just once. The corruption happened, and the game went along on its way. In this case, you need to try to track down what triggered that event. Testers need to try to find a way of duplicating the circumstances that lead to the visual corruption. Video of the game is very useful in this case.
If the corruption appears to be animating, if the corrupt section is flickering, or the banded area is flashing on and off, then you have some ongoing corruption. If the game remains in this state, it should make it easier to debug.
Corrupt meshes usually result in some vertices being displaced a considerable distance from the model. If the corruption region is small (a word or so), then you may just see one vertex displaced, this will appear as a thin triangle or line that extends off screen and swings wildly about as the model animates.
Corruption of a large amount of the model's mesh can result in the model “exploding”, covering the entire screen with random looking triangles that flicker and swing around.
Corruptions of the underlying skeleton data, or associated animation, can result in the model still looking somewhat recognizable, but with the various body parts being displaced to unusual locations. Corrupt animation will result in body parts flickering and jumping around wildly. The exact manifestation of the symptoms of corruption depends upon the method used to store the animations.