Gamasutra: The Art & Business of Making Gamesspacer
View All     RSS
September 22, 2014
arrowPress Releases
September 22, 2014
PR Newswire
View All





If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 
The Chaos of Memory Corruption
by Karl Schmidt on 07/02/13 11:40:00 am   Expert Blogs   Featured Blogs

The following blog post, unless otherwise noted, was written by a member of Gamasutraís community.
The thoughts and opinions expressed are those of the writer and not Gamasutra or its parent company.

 

I had the pleasure of working on Warhammer 40,000: Dawn of War II Chaos Rising, and fixed a tricky bug involving memory corruption. It was a bit funny since the title had a ‘corruption system’ as a gameplay feature.

The problem was an intermittent crash that would occur after you defeat the final boss in the final level of the single player game. Luckily you didn’t have to complete the prior missions in the same session to trigger the crash, just playing the sole mission would suffice.

(For the non-programmers out there, we need to know what steps will recreate the crash, so we can reproduce it with additional tools enabled to figure out what happened.)

I would need a way of reproducing the crash quickly and repeatedly. Using mission script cheats,  I could skip over sections of the level to get to the problem area as fast as possible. It didn’t help that it was actually quite a fun level (designed by John Capozzi) so sometimes I would get lost in “actually playing” the level instead of skipping through to gather information on the crash :)

MemCorruptionImage

After recreating the crash a number of times (saving mini-dumps to share with others to get more eyes on it), it was clear we still didn’t fully understand what was going on. I finally narrowed down a set of steps that allowed me to semi-reliably recreate the crash in about 15 minutes (more or less, I don’t remember fully) – but that wasn’t enough. The call-stack varied with each crash, so the suspicion was memory corruption. I would need some tools to continue investigation.

I tried to use Microsoft App Verifier, but most options either didn’t help, caused too much of a slowdown (due to the game not being very efficient when it came to minimizing allocations) or would crash due to an out of memory situation. At the time my development PC was running Windows XP 32-bit, and the game on its highest settings had a high watermark of being close to the 32-bit application address space limit. It didn’t help that the crash was more likely to occur when running on very high settings. I tried disabling loading of textures, basically anything I could do to spare memory. I eventually stopped this approach since even with the easy memory-reduction steps, they still didn’t yield enough resources for the advanced App Verifier tests to run successfully.

I started thinking about when the crash was occurring. It was at the end of the level. The crash stacktrace usually ended up somewhere in the audio system. I then started disabling massive sections of the engine. It still happened with audio disabled, so that couldn’t be it. Same thing with physics. Eventually I narrowed down a combination of visual fidelity options that were key in reproducing the issue. One of them was related to ‘weather’…

The crash occurred at the end of the level, after you defeat the final boss. This was also right after (subtly, to me) the weather would transition from raining blood (it was a Warhammer game after all) to normal rain. It turns out the weather transition code was written after the original Dawn of War II shipped, under time pressure, but the feature was (very) rarely used. Could this be the culprit?

I blanket-commented-out the transition code and the crash stopped occurring. Then I went file-by-file (it was implemented in just a handful of source files), and eventually function by function until I read some code more closely. This is a very simplified version of what I found:

// container and container2 are std::vectors
for( int i = 0; i < container.size(); ++i )
{
    container[i] = something1;
}
for( int i = 0; i < container.size(); ++i )
{
    container2[i] = something2;
}

Do you see the issue? It’s a bit more apparent in this snippet than the actual code, since there was more going on in and outside of the loops, and variables had less generic names.

In the second loop, we are iterating over the length of container, but writing to container2. If container2.size() < container.size(), we have a problem called memory corruption due to overwriting past the bounds of an array.

The fix for this was to change the second loop to check if i < container2.size(). I tested extensively afterwards and was relieved that this resolved the crash. Hurray! Except I couldn’t shake the sense of dread that more bugs like this were lurking throughout the codebase. I dug into the implementation of std::vector we were using, and luckily there was a debug flag that does bound checking on [] accesses. Surprisingly I wasn’t able to trigger any more with the flag on, but by testing on the unaltered code it definitely triggered on the original case.

Unfortunately this didn’t get fixed before we shipped. I know it must have been frustrating to crash right after the final boss, in the final mission. I’m glad we were able to ship my fix in a patch less than a month after we shipped, even with the GFWL certification process slowing us down. Bugs like these remind me that the game industry could use some unit testing practices. If we were testing all features on a regular basis, I’m sure the issue would have cropped up sooner. At least the codebase only had one of this particular breed of bug!

 

“Thinking… please wait” image courtesy of http://www.flickr.com/photos/75608170@N00/3623768629/


This was originally posted at http://www.karlschmidt.net/the-chaos-of-memory-corruption/.


Related Jobs

Infinity Ward / Activision
Infinity Ward / Activision — Woodland Hills, California, United States
[09.20.14]

Senior AI Engineer
Infinity Ward / Activision
Infinity Ward / Activision — Woodland Hills, California, United States
[09.20.14]

Lead Tools Engineer - Infinity Ward
Insomniac Games
Insomniac Games — Burbank , California, United States
[09.19.14]

Senior Engine Programmer
Fun Bits Interactive
Fun Bits Interactive — SEATTLE, Washington, United States
[09.19.14]

Senior Engine Programmer






Comments


Kayne Ruse
profile image
Been there, done that, but admittedly on a smaller code base than this. Good to know I'm not the only one who goes through this ;)

Karl Schmidt
profile image
Unfortunately many have their own tale(s) of similar issues :)

David Amador
profile image
Oh, and the good old copy paste is the culprit :P
It's weird how something this "simple" can bring down a whole system. Thanks for sharing

Michael ELBAKI
profile image
Yeah, we had the exact same problem when porting Platypus to smartphones. Only the PalmOS version would crash, far into the game. Because it was PalmOS, the tools were utterly bad, no way to debug ARM code, very limited log, ...
And by disabling stuff here and there, finally read the two same blocks of for(;;)

Karl Schmidt
profile image
Sounds not fun, but good job for finding it!

Daniel Camus
profile image
Nice to read it, reminds me a project we had where a random crash occur at loading, always crashing at loading different assets, very hard to find, at the end we found that some guy thought that it was clever to declare global static pointers and initialize them with a function returning an address of a new inside of it...

This bug only occurs on Android, on Windows and iOS "worked" fine...

Karl Schmidt
profile image
Yes, bugs that occur intermittently, or only on certain platforms, are no fun at all. Thanks for sharing your story!

Lihim Sidhe
profile image
I had to Google array but other than that I was able to follow what was going on here. So why is it all Computer Science majors require advanced math like Calculus? Mr. Schmidt... would Calculus have been helpful in the debugging process?

Karl Schmidt
profile image
Calculus would not be helpful for this particular problem, but very helpful for other things like physics, curved surfaces, and other areas of computing. This is more of a logic/general problem-solving issue :)

Marc Audouy
profile image
using standard arrays/vectors with no out-of-bounds checks has always been a bad idea. curse C++ for sure, but any game engine I have seen has custom implementations of arrays, with these kind of features in.

Ryan Christensen
profile image
It is a bit refreshing to see STL being used over custom objects, arrays, string classes etc. Like you said, most game engines have custom objects/arrays/strings/threads/mutexs/circular queues/ etc for varying reasons (multiplatform, special needs, memory management or custom allocators, ego etc) but I have seen way more bugs due to custom objects when STL containers would have sufficed. This is a programmer error not really a library error. The cool + useful thing about using standard libs if no other need exists is it is much more portable and not locked to a single set of engine programmers (also reduces ramp up time for new devs). There is a strong desire to recreate everything in games when stl, boost, poco or other libs have more solid base classes in many cases.

Karl Schmidt
profile image
We were using STLPort with custom allocators where they made sense. As I mentioned in the article, it was after I found the problem that I discovered that there was a flag to enable bounds checking in the vector implementation we were using. It's a performance trade-off whether to leave such a flag on or off, though.

Matthew Fioravante
profile image
Another endorsement for the C++11 foreach loop.

Matthew Bentley
profile image
Why not iterators & from .begin to .end? Were you not doing any additions to the vector?

Karl Schmidt
profile image
I don't quite remember (and no longer have access to the source code) but I believe there may have been indexing into another vector of the same size going on in one of the loops. So it may have been more convenient to index by a number instead of creating and incrementing multiple iterators.

Jeff Lansing
profile image
IMO this isn't a problem for unit tests or code reviews --> this is a job for assertions. Loud ones.

Asserts can (and should) be implemented as a macro that compiles out of final builds completely. Sure, there is a performance cost in development builds but the amount of time asserts save you will far surpass this hit. Using things like hardware breakpoints and core dumps to find overruns is NOT efficient at all.

Furthermore, not all asserts actually need to be fatal, and a big fat error message will usually be sufficient for finding these problems as soon as they start. I've seen some which actually use __debugbreak so you get the full call stack in the debugger that lead to the crash/problem when it happens.

Exceptions are way too slow. Return codes are impractical for deep callstacks. Both let things get swept under the rug. Asserts don't. Use them liberally.

Karl Schmidt
profile image
The Essence engine (especially the DOW2 branch) used assertions heavily, customized based on build target, calls __debugbreak, displays a custom dialog, even sends an email with a stacktrace, etc - but unless it was asserting on out-of-bounds access for std::vector, they wouldn't have helped in this situation :)

I completely agree with loud assertions :)


none
 
Comment: