Gamasutra: The Art & Business of Making Gamesspacer
Debugging Memory Corruption in Game Development
arrowPress Releases
October 24, 2014
PR Newswire
View All





If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 
Debugging Memory Corruption in Game Development

October 16, 2008 Article Start Page 1 of 6 Next
 

[In an in-depth technical article, Neversoft co-founder Mick West discusses memory corruption in games, pinpointing the symptoms, causes, and solutions for game glitches and crashes caused by the tricky problem.]

Definition: Memory corruption is an unexpected change in the contents of a memory location.

The symptoms of memory corruption can range from hard crashes, all the way through minor glitches, to no symptoms at all. The causes of memory corruption are many and varied, and include memory corruption itself. In this article I attempt to classify the various ways in which memory corruption can manifest itself, the various causes, and some ideas for identifying the root causes of various types of memory corruption. I'll cover:

  • Symptoms of Memory Corruption
  • Investigating Corruption
  • Identifying Hex Droppings
  • Causes and Effects of Corruption

Symptoms of Memory Corruption

Given that memory corruption can manifest in almost any way, it seems redundant to list all the symptoms. However, different symptoms of memory corruption are indicative of different causes of corruption. Sometimes we can also gather valuable clues from the type of symptom, which might lead us closer to the cause of the corruption.

Crashes

Crash bugs come in all flavors, but memory corruption can cause just about all of them. The way in which the game crashes can give you valuable clues as to what type of memory corruption is occurring. These clues can indicate where you need to start looking for the cause of the crash.

Address Error

An address error indicates that a pointer has been modified to point to an illegal address. This could be an address that is: not word aligned, NULL, or an address that points to unmapped or protected memory.

Address errors are quite helpful, since program execution stops when an address error is encountered, and it is quite easy to enter the debugger, and determine the address and corrupted contents of the pointer variable that is being used.

Infinite Loop

Corruption of data can make a loop fail to terminate. Take for example code that traverses a linked list. Memory may be corrupted in such a way that the list contains a loop. Since the code expects the list to terminate with a NULL value at some point, it simply carries on around the loop forever.

This behavior is a lot more likely with a list that uses indexing instead of pointers, but it is still possible with pointer. Consider the implication of memory being corrupted in such a way that a list gets a pointer corrupted so that the list is now circular. It is very unlikely that some random corruption, or some unrelated code would happen to stick a semi-valid pointer in the right place. Hence, it is more likely that the corruption was something in the list code itself.

Illegal Instruction

An illegal instruction could mean one of several kinds of memory corruption.

Stack Corruption - If the stack has been corrupted in some way, this can lead to an incorrect return address, which ends up pointing to illegal code. This is the most common way buffer overruns are exploited by hackers.

Jump Table Corruption - if a v-table (or any kind of table of jump-addresses) is corrupted, then the PC can end up pointing at an illegal instruction.

Code Corruption - The code itself can be corrupted by a bad pointer corrupting sections of the code. This type of corruption can be very hard to detect if the code that has been corrupted is not executed very often.

Stack Overwriting Code - A special kind of code corruption. Runaway recursion can sometimes run unchecked until the stack overwrites the routine that it is executing. This shows up nicely in the debugger in a hex window.

Function Pointers - Since function pointers are sometimes stored in data structures, and passed around like regular variables, then they can be corrupted just like any other variable. This can eventually lead to the program executing incorrect code.

Unexpected Values

If you have a variable of some kind that normally has a value in a certain range, and you unexpectedly find that the variable contains some ridiculous value, then this may be due to memory corruption.

Wildly unusual values often have noticeable effects, such as the player teleporting to the end of the universe, or a model being scaled infinitely large.

Less severe corruptions can occur, for example, a counter might simply be reset to zero or even just changed slightly. This type of corruption can be difficult to track down, as it may not produce especially noticeable effects.

Here a good testing department is invaluable. If the testers can notice little inconsistencies like this, then you will catch potentially harmful bugs at a much earlier stage.

Since the location of the corruption of memory is often somewhat random, then the problem may go undetected for some time. This may give the false impression that the existing code is solid. Upon adding new code or data, the bug may reveal itself, causing you to think that the new code has caused the bug, when in fact the new code has only cause memory to be slightly re-ordered into a configuration that reveals a pre-existing bug.

Glitches in the Graphics

Since memory often contains graphical data, then if memory is being corrupted, it may show us as some corruption in graphics. The way this is manifest will depend on the nature of the graphics, and the nature of the corruptions.

Textures

Changes in color of a single pixel, or a very short row or column of pixels, indicates that a pointer to a variable has acquired a wrong value, perhaps the result of earlier memory corruption.

Changes to large swathes of a texture indicate either an incorrect pointer, or some kind of buffer overrun. Corruption that looks like a regular patter, often containing vertical or diagonal stripes indicates some kind of array exceeding its bounds, or one that is now at an incorrect address due to a corrupt pointer.

Corruption in a texture that resembles a squashed or discolored version of another texture indicates that you might be overwriting the texture with another one of different dimensions or different bit depth.

If the corruption is static (unchanging), then it indicates a one-time event, where a pointer was misused just once. The corruption happened, and the game went along on its way. In this case, you need to try to track down what triggered that event. Testers need to try to find a way of duplicating the circumstances that lead to the visual corruption. Video of the game is very useful in this case.

If the corruption appears to be animating, if the corrupt section is flickering, or the banded area is flashing on and off, then you have some ongoing corruption. If the game remains in this state, it should make it easier to debug.

Meshes

Corrupt meshes usually result in some vertices being displaced a considerable distance from the model. If the corruption region is small (a word or so), then you may just see one vertex displaced, this will appear as a thin triangle or line that extends off screen and swings wildly about as the model animates.

Corruption of a large amount of the model's mesh can result in the model “exploding”, covering the entire screen with random looking triangles that flicker and swing around.

Skeletons and Animation

Corruptions of the underlying skeleton data, or associated animation, can result in the model still looking somewhat recognizable, but with the various body parts being displaced to unusual locations. Corrupt animation will result in body parts flickering and jumping around wildly. The exact manifestation of the symptoms of corruption depends upon the method used to store the animations.


Article Start Page 1 of 6 Next

Related Jobs

Activision Publishing
Activision Publishing — Santa Monica, California, United States
[10.24.14]

Tools Programmer-Central Team
Bluepoint Games, Inc.
Bluepoint Games, Inc. — Austin, Texas, United States
[10.23.14]

Senior Graphics (or Generalist) Engineer
Intel
Intel — Folsom, California, United States
[10.23.14]

Senior Graphics Software Engineer
Wargaming.net
Wargaming.net — Hunt Valley, Maryland, United States
[10.23.14]

Graphics Software Engineer






Comments


Cristian Cornea
profile image
Great article. I really like the fact that it goes in depth with memory debugging and how to identify patterns in the memory dump.



Great job!

Tom Newman
profile image
Great article! I am not even a programmer and I was able to make sense out of this. Very well written.

Roberto Alfonso
profile image
As a professional programmer (healthcare industry) I love technical articles, especially if related with video games. I still remember the first time I happened to cross the infamous Relm Sketch bug in Final Fantasy VI (http://en.wikipedia.org/wiki/Characters_of_Final_Fantasy_VI#Relm , third paragraph). Using it correctly, you could get hundreds of items to sell, including Ilumina, the sword that was supposed to be unique and only attainable by converting the Ragnarok esper into sword when asked.



I suggest everyone having these problems to use a memory checking library like Fortify, ElectricFence, or even Valgrind. Personally, I initialize certain memory zones with 0xBEBACAFE ("drink coffee" in Spanish) instead of the famous 0xDEADBEEF.

Gerard Green
profile image
We like to initialize memory with 0x7FBADFAD because it is sNaN if interpreted as a floating-point value. Trying to use this value causes an exception, which makes it much easier to track. (Even if on your system it doesn't cause an exception, NaN propagates in normal operations, so it's still easier to track down than 0xDEADBEEF.)

Ondrej Spanel
profile image
While debugging techniques to detect memory corruption are useful, I would like to stress what is most important is to use coding techniques which reduce chance of introducing such corruption in first place. They can be summarized as: avoid low-level construct everywhere high-level constructs do the job fast enough. This includes:



- avoid explicit new / delete / malloc / free

- avoid "raw" pointers



Instead:



- use smart pointers

- use containers



In any case where you are using raw pointers or explicit allocation, make sure you have clearly defined (and documented) the responsibilities (who "owns" the memory, who is responsible to free it, when there is a raw pointer to a memory which is does not own, describe why can you be sure the pointer will be valid during its existence).

Roberto Alfonso
profile image
I hate smart pointers, mostly because I am old schooled (I used to program in assembler for most optimization projects) and therefore think they create lazy programmers (just like garbage collectors). As long as you follow a few guidelines, you should have no problem working with them:

- Always initialize pointers with NULL

- Set pointers to NULL immediately after deleting or freeing them

- If a function returns allocated memory, there must be a reciprocal function that receives a pointer and frees it (create_node should have a destroy_node, for example)

- If a function does not allocate memory, it mustn't release it



Different strokes I guess :-)

Gopalakrishna Palem
profile image
A similar approach that can be used to track object state changes at code level can be found at:



http://blogs.msdn.com/gpalem/archive/2008/06/19/tracking-c-variab
le-state-changes.aspx



Its an off-line technique for tracking the C++ objects. Similar to "break on Access" of memory, C++ templates can be used to implement "report on change" pattern for any member variable value changes. May not solve all problems, but can be useful often.

Wylie Garvin
profile image
A good article, for sure. There are a couple of things I want to add.



(1) Code Corruption, Jump Table Corruption, and Stack Overwriting Code are prevented by the memory protection hardware on most modern platforms. For PC and current consoles, it is typical for code space to be read-only, so the program will crash immediately if it tries to modify its own code. Most compilers also put v-tables and other jump tables (e.g. from switch statements) into a read-only code segment. You can also expect non-code to be non-executable on modern platforms, so wild jumps or overwritten return addresses have a good chance of being stopped dead by the memory protection hardware if they point into the data segment or stack. As the article mentions, stack-overflow on PC will usually hit a guard page, stopping the program before it destroys your debugging context. However, one processor which game developers have to deal with unfortunately has no memory protection at all: the SPUs of the PS3 Cell processor. SPU code can happily overwrite itself, jump to a data address and execute garbage, or overflow its stack.



(2) Faulty RAM on consoles - on my last project we discovered no less than separate devkits with faulty RAM problems of some sort. They were each exhibiting signs of memory corruption, and after the first one was discovered (it happened to be mine) we wrote a little test program to stress-test the console's RAM with a variety of test patterns. All our other devkits pass the test program, but these four failed its tests. In the first such kit (mine), the symptom was a 1-bit error in a particular offset into a 64 KB page of virtual memory -- different virtual address each time we ran it, but the bottom 16 bits were always the same -- so probably it was the same physical address underneath. I wasted a full day debugging the corruption before we caught on that the bug only happened on my kit, not my neighbor's. The code worked flawlessly on other kits. So we wrote that test program and proved that the 1-bit error had nothing to do with our program, but was just a hardware problem of some kind. After that, the first thing we did when memory corruption was suspected, was to run this test program on the affected kit. Over the course of several months, three other devkits failed its tests, so we knew immediately that it was a hardware problem and didn't waste our time trying to debug a non-existent bug. I know bad RAM sounds implausible, but I suppose we use our devkits heavily for years and years, and we don't always turn them off when we're not using them... anyways, for console developers, be aware of

the possibility of bad RAM when investigating strange corruption bugs.



(3) Put Magic Numbers in structures to aid debugging: For non-final builds, you can add a 32-bit int field to the start of each class/structure you use, and in the constructor or initialization function, set this to a recognizable ASCII code (which is different for each type of structure). E.g. you might set it initially to 'ANIa' or 'ANId' for animation structures, 'OBJx' or 'CHAR' for objects or characters, etc. This can make it easier to recognize structures in raw memory when you have to debug something like a memory corruption. Also, when the structure is freed you can change the magic value to something else, such as the same value with the case of each character toggled ('aniA', 'objX', etc.). Whenever it accesses these objects, your code can also assert that the magic number has the original value, helping you to catch any dangling pointer errors. This technique has some runtime overhead, but depending on your circumstances it might be worth it.

Wylie Garvin
profile image
Oops.. no edit button?



"we discovered no less than four* separate devkits with faulty RAM problems of some sort."


none
 
Comment: