Gamasutra: The Art & Business of Making Gamesspacer
View All     RSS
October 25, 2014
arrowPress Releases
October 25, 2014
PR Newswire
View All
View All     Submit Event





If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 
My Hardest Bug Ever
by Dave Baggett on 10/31/13 02:36:00 pm   Expert Blogs   Featured Blogs

The following blog post, unless otherwise noted, was written by a member of Gamasutra’s community.
The thoughts and opinions expressed are those of the writer and not Gamasutra or its parent company.

 

(Originally posted on Quora)

As a programmer, you learn to blame your code first, second, and third... and somewhere around 10,000th you blame the compiler. Well down the list after that, you blame the hardware. 

This is my hardware bug story.

Among other things, I wrote the memory card (load/save) code for Crash Bandicoot. For a swaggering game coder, this is like a walk in the park; I expected it would take a few days. I ended up debugging that code for 6 weeks. I did other stuff during that time, but I kept coming back to this bug -- a few hours every few days. It was agonizing.

The symptom was that you'd go to save your progress and it would access the memory card,  and almost all the time, it worked normally... But every once in a while the write or read would time out... for no obvious reason. A short write would often corrupt the memory card. The player would go to save, and not only would we not save, we'd wipe their memory card. D'Oh.

After a while, our producer at Sony, Connie Booth, began to panic. We obviously couldn't ship the game with that bug, and after six weeks I still had no clue what the problem was. Via Connie we put the word out to other PS1 devs -- had anybody seen anything like this? Nope. Absolutely nobody had any problems with the memory card system.

About the only thing you can do when you run out of ideas debugging is divide and conquer: keep removing more and more of the errant program's code until you're left with something relatively small that still exhibits the problem. You keep carving parts away until the only stuff left is where the bug is.

The challenge with this in the context of, say, a video game is that it's very hard to remove pieces. How do you still run the game if you remove the code that simulates gravity in the game? Or renders the characters? 

What you have to do is replace entire modules with stubs that pretend to do the real thing, but actually do something completely trivial that can't be buggy. You have to write new scaffolding code just to keep things working at all. It is a slow, painful process.

Long story short: I did this. I kept removing more and more hunks of code until I ended up, pretty much, with nothing but the startup code -- just the code that set up the system to run the game, initialized the rendering hardware, etc. Of course, I couldn't put up the load/save menu at that point because I'd stubbed out all the graphics code. But I could pretend the user used the (invisible) load/save screen and asked to save, then write to the card.

I ultimately ended up with a pretty small amount of code that exhibited the problem -- but still randomly! Most of the time, it would work, but every once in a while, it would fail. Almost all of the actual Crash code had been removed, but it still happened. This was really baffling: the code that remained wasn't really doinganything.

At some moment -- it was probably 3am -- a thought entered my mind. Reading and writing (I/O) involves precise timing. Whether you're dealing with a hard drive, a compact flash card, a Bluetooth transmitter -- whatever -- the low-level code that reads and writes has to do so according to a clock

The clock lets the hardware device -- which isn't directly connected to the CPU -- stay in sync with the code the CPU is running. The clock determines the Baud Rate -- the rate at which data is sent from one side to the other. If the timing gets messed up, the hardware or the software -- or both -- get confused. This is really, really bad, and usually results in data corruption.

What if something in our setup code was messing up the timing somehow? I looked again at the code in the test program for timing-related stuff, and noticed that we set the programmable timer on the PS1 to 1kHz (1000 ticks/second). This is relatively fast; it was running at something like 100Hz in its default state when the PS1 started up. Most games, therefore, would have this timer running at 100Hz.

Andy, the lead (and only other) developer on the game, set the timer to 1kHz so that the motion calculations in Crash would be more accurate. Andy likes overkill, and if we were going to simulate gravity, we ought to do it as high-precision as possible!

But what if increasing this timer somehow interfered with the overall timing of the program, and therefore with the clock used to set the baud rate for the memory card?

I commented the timer code out. I couldn't make the error happen again. But this didn't mean it was fixed; the problem only happened randomly. What if I was just getting lucky?

As more days went on, I kept playing with my test program. The bug never happened again. I went back to the full Crash code base, and modified the load/save code to reset the programmable timer to its default setting (100 Hz) before accessing the memory card, then put it back to 1kHz afterwards. We never saw the read/write problems again.

But why?

I returned repeatedly to the test program, trying to detect some pattern to the errors that occurred when the timer was set to 1kHz. Eventually, I noticed that the errors happened when someone was playing with the PS1 controller. Since I would rarely do this myself -- why would I play with the controller when testing the load/save code? -- I hadn't noticed it. But one day one of the artists was waiting for me to finish testing -- I'm sure I was cursing at the time -- and he was nervously fiddling with the controller. It failed. "Wait, what? Hey, do that again!"

Once I had the insight that the two things were correlated, it was easy to reproduce: start writing to memory card, wiggle controller, corrupt memory card. Sure looked like a hardware bug to me.

I went back to Connie and told her what I'd found. She relayed this to one of the hardware engineers who had designed the PS1. "Impossible," she was told. "This cannot be a hardware problem." I told her to ask if I could speak with him.

He called me and, in his broken English and my (extremely) broken Japanese, we argued. I finally said, "just let me send you a 30-line test program that makes it happen when you wiggle the controller." He relented. This would be a waste of time, he assured me, and he was extremely busy with a new project, but he would oblige because we were a very important developer for Sony. I cleaned up my little test program and sent it over.

The next evening (we were in LA and he was in Tokyo, so it was evening for me when he came in the next day) he called me and sheepishly apologized. It was a hardware problem.

I've never been totally clear on what the exact problem was, but my impression from what I heard back from Sony HQ was that setting the programmable timer to a sufficiently high clock rate would interfere with things on the motherboard near the timer crystal. One of these things was the baud rate controller for the memory card, which also set the baud rate for the controllers. I'm not a hardware guy, so I'm pretty fuzzy on the details.

But the gist of it was that crosstalk between individual parts on the motherboard, and the combination of sending data over both the controller port and the memory card port while running the timer at 1kHz would cause bits to get dropped... and the data lost... and the card corrupted.

This is the only time in my entire programming life that I've debugged a problem caused by quantum mechanics.

 


Dave Baggett was the first employee at Naughty Dog and one of two programmers on Crash Bandicoot. Dave now focuses on curing inbox overload at his new startup, Inky.


Related Jobs

Red 5 Studios
Red 5 Studios — Orange County, California, United States
[10.24.14]

Graphics Programmer
Red 5 Studios
Red 5 Studios — Orange County, California, United States
[10.24.14]

Gameplay Programmer
Gearbox Software
Gearbox Software — Plano, Texas, United States
[10.24.14]

Server Programmer
Forio
Forio — San Francisco, California, United States
[10.24.14]

Web Application Developer Team Lead






Comments


Ben Sunshine-Hill
profile image
Well, we're working with semiconductors here. In a sense, all bugs are caused by quantum mechanics.

Dave Baggett
profile image
True, but most of the time what you're debugging is deterministic, because you're operating at a level where you can ignore the quantum non-determinism. E.g., your Python code should run the same way every time, even though the whole system rests on a quantum mechanical base.

This bug felt different, because the symptoms were non-deterministic -- and the non-determinism arose out of electrical interference.

Karl Schmidt
profile image
Great article - and what a 'fun' bug :)

Kris Graft
profile image
I'd love to hear other interesting bug stories here in the comments!

Thomas Happ
profile image
There was one time we came a cross a bug in rather widely used 3rd party UI library where certain DX11 fences weren't being set up properly - basically to verify that the GPU was finished with the current action before messing with it again - with the result that the vertex buffers were being messed up in hardware and making the game UI flicker and glitch like mad. The most alarming part was that using either PIX or Nsight, the vertex buffers looked perfectly sane - evidently they record your directx calls at some higher level, so when you play them back, the output was uncorrupted.

Fortunately it was a known bug to the 3rd party and the solution was for them to release a new patch, but even so, I was completely baffled for several days, over the course of which I stripped out all of the game's functionality save the UI.

Aaron Grossman
profile image
Great story, thanks for sharing!

David Farrell
profile image
Many moons ago I was developing Direct3D drivers for a video card company. This particular video card used a MIPS chip for command processing, and stored its microcode in the same video ram pool as textures, render targets, etc.

There was an occasional graphics card crash when running the D3D test programs. We discovered the crash was caused by a bad opcode in the microcode. Everything surrounding the bad opcode was fine, but a few bytes here and there were corrupted.

It appeared to be a hardware problem, and the hardware engineers started looking at the issue. In the meantime, I noticed a pattern to the corrupted bytes. It was always four bytes that were corrupted, and often the bytes would be 0xffffff00, or something similar, but with 0x00 in the lower byte. I realized that this looked a lot like a depth/stencil write, and checked the state of the depth buffer start address. It was 0, which happened to be the location in video memory of the microcode.

It turns out that on context switches to 2D mode, the driver would set the depth buffer start address to 0, but there was a bug in the switch back to 3D mode that caused the depth buffer start address to not be restored. The chip would then start rendering 3D with the microcode as the depth buffer, which actually did work for a while (most microcode bytes happened to cause the depth test to fail and so a new depth value wouldn't be written). Sooner or later, though, the depth test would pass and a new value would be written out, corrupting the microcode.

David Farrell
profile image
To connect that back to the blog post: "As a programmer, you learn to blame your code first, second, and third... and somewhere around 10,000th you blame the compiler. Well down the list after that, you blame the hardware." That was true in my case; it was absolutely a software issue, and not a compiler or hardware bug.

Jim Buck
profile image
"Absolutely nobody had any problems with the memory card system." - It sounds like you weren't asking the right people since EVERYONE had problems with the memory card system on the PS1, if judging by the newsgroups (what are those?? ;) ) at the time is a relevant data point. :)

Having said that, we've fairly occasionally came across compiler bugs that would do weird things every once in a great while on platforms as far back as PS1.

Dave Baggett
profile image
LOL. My memory is fuzzy, but "newsgroups" -- yes, yes, those sound familiar. ;)

Bill Todd
profile image
No way. Everything on PS1 was perfect Jim. I thought another major issue was LoadExec(). It seemed like a lot of teams would build their front end as another executable and switching between the two would fail randomly. Sony Japan would always say "impossible!" lol.

The memory card system was ugly on PS1. No other way to describe it.

Chris Clogg
profile image
I wish I had bug stories as interesting as these haha. One scary bug (in Stratosphere Multiplayer Defense) was related to the iOS 5 beta; Apple had done something that made 2 buttons not clickable at the same time, and we basically had to pray it would get fixed by the time actual iOS 5 came out (and luckily it did).

Another bug I remember was when a player hit their iPad's home button, it removed every currently-running animation (Apple just does that for some reason with Core Animation). So when the user would come back from multitasking or whatever, the game would be totally messed up... monsters would be snapped ahead and all sorts of weird behavior. I couldn't find any resources/posts/documentation on this, and was pretty hopeless for a while, until I ended up fixing it by storing every animation in memory before app-exit, and re-adding them to each object upon return (moment of magic when it worked lol). The funny thing is I wrote my solution on StackOverflow, and now that one post has 20 up-votes and is basically my only source of points on that site heh.

Christopher Drum
profile image
Core Animation is OpenGL-based. As calling OpenGL on the background is verboten, Apple stops Core Animation routines when the player hits the Home button. If I recall, however, Apple's Core Animation Programming Guide says to do exactly what you discovered, as far as storing/restoring said animations.

Maurício Gomes
profile image
I spent weeks chasing around a bug that made my tiled games look weird on iOS. Since I had just changed SDK, I thought it was the new SDK fault for a long while.

Until I noticed: The bug was ONLY on iOS, ONLY on PNG files, and ONLY on PNG files that I had re-compressed.

End result: It is a bug in iOS that Apple never fixed, here is the stackoverflow of it: http://stackoverflow.com/questions/9630870/ios-5-1-with-xcode-4-3
-1-uicolor-colorwithpatternimage-strange-behavior-only

Note with that bug: it happened regardless of how you put images on the memory, if it was OpenGL or not, it does not matter, it DID happened.

Michael Joseph
profile image
Und how did zat make you feel? 6 weeks of hunting with a producer entering panic mode and cursing your ancestors, your dog and your cat behind your back (I jest of course :p) sounds like fun times...

but finally solving the issue / providing a workaround and proving it to be a hardware issue... priceless?

Karl Schmidt
profile image
It's a PNG decompression issue - that's why the behaviour would be the same if you display with UIKit or straight OpenGL (I believe UIKit is using OpenGL anyways, but I digress)

(This is in response to Maurício Gomes but I messed up and didn't comment properly)

Kale Menges
profile image
I'm curious as to what Sony's ultimate response to the hardware problem was. Did they address hardware configurations in later models or simply document new developer guidelines or something?

Dave Baggett
profile image
I believe they slipstreamed in a fix, but my memory is fuzzy on that point.

Maurício Gomes
profile image
I believe they did not, considering the amount of games that still wipe out memory cards (this also applies to PS2, most infamously with Soul Calibur 3 Chronicle of the Sword mode)

E Zachary Knight
profile image
This reminds me of something similar hat happened as a gamer. I bought the Final Fantasy Anthology games for the PS1. This is the FF5 and FF6 combo pack. I had a PS2 at the time. Every once and a while, my FF5 save files would become corrupted and I would lose them. The save screen sprites would corrupt and I would have to exit out of the full menu for the sprites to revert to their proper selves.

I reported it back to Square and Sony, but neither company admitted to being able to recreate the bug. I still don't know what was going on. I have not yet tried the games on my PS3, but probably should at some point to see if I continue to get the problem.

Arnaud Clermonté
profile image
I was given a fun hardware/code18 bug once:

Repro steps:
In the game, go to such room and face the door.
Then put down the controller and have your lunch break.
When the lunch break is over, the door has vanished.
Happens on PS3 only.

Makes no sense to me, yet when I followed the repro steps, it happens exactly as described ! O_O
It didn't work if I just let the game soak for an hour. No. It had to be an actual lunch break for the bug to happen.
After searching through the game log, I found out the chain of events:

When you put down a PS3 controller on a desk and push it away so that you can browse the web or eat, there is an almost 100% chance of accidentally pressing the trigger button which is mapped to the "throw grenade" action... which destroys the door.

Shea Rutsatz
profile image
Ha! Thats a good bug.

Chase Sechrist
profile image
This reminds me of the intro to Crash 2. "Crystals... Of course!"

Albert Meranda
profile image
Love stories like this. Thanks!

Diego Merayo
profile image
I had to deal with a very fun bug in our scheduling system that was causing a crash with some third party library because it uses the TLS. In our code there as small window of a few opcodes where if a context switch would happen it could change the thread affinity for the job running that library, so when the library tried to access the TLS that didn't exist in the new thread it will blow up. Of course it was random and you would have to leave the game soaking for hours until you get it, and we could not add any debugging code to the scheduler because it would make it run slower and then the crash was not happening.

Also this is what I call "fun":
http://pythonsweetness.tumblr.com/post/64740079543/how-to-lose-17
2-222-a-second-for-45-minutes

Ron Dippold
profile image
I had an at least four month saga (not full time) where suddenly boards in the field would have the non-volatile memory (nvm) die, and of course that's a return. The symptom was you try to write to the nvm and that entire thread locks up forever, which eventually brings everything down.

The SDK we were using had a wear leveling method that looked sound enough. They were way overlogging things, so I cut that way back, but the numbers still didn't add up. Poring through the data sheets... everything looks fine. They were using the part as described.

I added a bunch more code to debug the problem. Things were locking up because there was no error or timeout checking, so I added that and some code to spew errors when it happened. So now at least the device would keep working. I added a total bytes written counter (this only got written occasionally, so that was not a catastrophic feedback loop). Stress out some more boards, and they were still dying way too fast. 20k writes per byte? More like 200!

Finally with all this logging I noticed that the failure was mostly that flipping the chip write protect bit from on to off (for a write) would fail. The original code was then waiting for the bit to flip in the readback, but this would never come. It was like the bit was stuck. How could a register get stuck? I added some code to keep pounding it, and if you keep pounding on it it might unstick, but once it happened it was more and more likely. This sounded a lot like a physical problem. We were able to reproduce this with a new chip just sitting in a breadboard.

Finally, after hours on the phone to the company that makes the part, we got an admission over the phone (never in email or writing) that the config register has a 10M write limit. The status register apparently has some slightly hardier flash under it. This is nowhere in the databook or anywhere else, and the people who wrote the SDK had no idea, so they just flipped that bit with abandon. After all, it's just like toggling a GPIO. Who worries about that? If you did a firmware burn it would flip at every page, so that's 80 thousand writes burned just for a firmware upgrade. Every time you logged or saved something, that's two more writes. They add up fast.

I rewrote the code to be a lot more conservative about touching that register - problem solved.

Andy Mentges
profile image
Now that is a hard bug to track down! It was an EMI/EMC problem though. Not quantum mechanics. EMI is a lot easier to deal with than quantum since it is just vanilla physics.

Rick Duggan
profile image
Mine involves what I first thought was a software configuration error that ended up needing to be debugged with a multimeter.

We had just gotten a new peripheral piece of equipment that connected to our system via an RS-232 cable. The guys in the lab installed it but left it to me to connect. I went down to our lab in the basement, grabbed a crossover cable, and plugged in the two systems. The peripheral clicked when I plugged in the cable, so I figured it had recognized the connection. I went back up to my desk on the 2nd floor and remoted in to our system. I then tried to connect to the new peripheral. No dice. So I make sure the RS-232 settings (baud rate, stop bits, etc.) are all correct. Yup, good.

Well, maybe I should RTFM, right? So I grabbed the manual and it described the exact set of steps I had just done. So I started to think maybe the cable was bad, so I went down to the lab and grabbed another one. Swapped them out, same good click.

Back upstairs, try to connect again. Still no good. This time I go back to the lab and grab a multimeter to make sure the crossover cable was wired correctly. It was. At this point I want to connect to the console, but that particular system didn't have one (which is why I had to keep going upstairs). So back upstairs just to check. No good again.

Back downstairs, I decided to start checking voltages. I can't remember if it was 5V or 12V, but either way, both sides were signaling correctly at that voltage. But when I checked the grounds between the two systems, there was a 20V difference! Turns out the lab guys had wired the peripheral to incoming power, whereas our stuff was on a separate filtered power grid. The 20V ground difference made it impossible for the systems to detect the signaling voltage. And the clicking I had heard was not a good click. It was a protection relay in the peripheral disconnecting the RS-232 board from the rest of the system to prevent it from being damaged.

Got the power rewired and everything worked like a champ!

Ron Dippold
profile image
I'm noticing a theme here... and looking back, all my best bugs do involve hardware.

Though from this month's Retro Gamer, there was a bug in Nigel Adderson's Spectrum conversion of Commando that involved hitting pause. The fastest way of changing things on screen was moving the stack to the screen memory and pushing things on the stack (!), but then you had to be careful the 50Hz refresh interrupt didn't clobber you. There was one tester, and only one tester who could blow up the game on the bridge level. She'd end up pausing it, and there was a two byte stack miscalculation in that section and that 50 Hz refresh would come by and clobber it while it was paused. That's kind of hardware, but not the hardware's fault.


none
 
Comment: