Post Mortem - Death of a test-suite
The thoughts and opinions expressed are those of the writer and not Gamasutra or its parent company.
In most QA departments, there is a test-suite which contains test-cases that are run by the QA team to test a myriad of things like game features, content or 100% playthroughs. Having worked on a live MMO game, these test-cases are regularly used on new content alongside standard tests that are performed with each update. This is to promote quality and confidence in our product and these tests have been honed and tweaked over the years. Your team will no doubt have a similar test-suite but the question I'd like to pose is:
What would happen if all your testcases in your test-suite disappeared?
This was a question I asked at a QA conference in response to a talk centered around handling large test-suites. The orators response centered around their companies IT policy to protect and restore the data but it left me wondering how their department would function should it all hypothetically disappear. I asked this pointed question because this is exactly what happened on a project I worked on. Following is a post-mortem of what happened and the teams response to the situation.
It turned out that our test suite and its database was not considered a "critical system" so there was no backup
Our test-suite consisted of several hundred individual test cases which where grouped together by a common theme. For instance, if a new weapon was being tested then we'd summon the "New Weapon" template which would populate common test-cases such as it's name, stats, tradeability as well as unique ones to that asset type (Equipping checks, Damage checks). Furthermore, whatever made that weapon unique would be reviewed and specific test-cases would be added to the test-script. These templates where detailed and built up over many years of testing having originally grown from an Excel "master" document.
Over time, we translated the Excel doc into a database and created a bespoke web front end to use that database to auto generate testplans for us. Test-plans created for new systems where integrated into the test-suite for ongoing testing. When a bug went out into the wild, typically a test-case was added to the relevant template to check for it in future. As such, the test-suite ballooned in size.
This test-case focus culture resulted in the QA team spending a lot of time focusing on the Pass/Fail nature of everything. This was of benefit for new starts as we could sit them with a test-script and it was detailed enough for them to hit the ground running. Longer term, however, this resulted in a tick box culture and performance driven by how many test-cases being completed with exploratory and destructive testing taking a back seat.
Testers reported trouble in accessing our bespoke centralized test-plan software. The tools developer responsible (who is embedded in our QA team) for it's maintenance was informed and it was clear that the error was on the hosting/networking side so it was escalated to the Sysadmin team. Testers where advised to carry on testing from memory where possible, switching up to destructive and non scripted testing in the meantime.
Day 0 - 6 hours
It became clear that something significant had happened. The machines that hosted both the software and database had become corrupted and needed to be relocated to a new machine. Our developer was given a new server and requested a copy of the backup. Worst case scenario we would have lost that days worth of data.
It turned out that our test suite and it's database was not considered a "critical system" so there was no backup. Our developer had a local copy of the software on his machine but not the database. The sys admin team tried to perform a recovery of the corrupted data. The QA team where advised to carry on testing, using their best judgement. They also got together to do some brainstorming and identified areas to focus testing around. Morale was dropping as rumours that the whole database was gone was growing.
The attempt to recover the database failed and it was confirmed that all the data was lost. The QA leadership group got together to form a plan and it was decided that we'd try to recreate the testsuite from memory. We got the old Excel "master sheet" to help us reference specific tests to include. The software was back online but without the database, the team was still working blind. Some of the newer members struggled whereas the more experienced testers where able to pick up ad-hoc testing quite easily. A lot of them enjoyed the freedom and the break from the admin heavy approach.
After an all day meeting to map the high level test-cases we had lost (we agreed to put the detail in later) one of the games updates had gone live. The update itself was stable, testers seemed happier with the admin light approach and in general, the world hadn't stop spinning due to the loss of the test-suite.
The loss also gave us some grace to try a new approach as we could weather any reduction in quality due to having the perfect scapegoat.
This led us to a radical thought "Let's literally use this as a clean slate and re-design it around our ideals, strengths and modernize it for today's challenges". This paradigm shift was welcomed and within a few hours we had a lightweight vision of the test-suite that was minimalist in nature, promoted coverage based on risk and inspired testers to explore their craft.
Before the data loss, an average test script for an in-game item would be ~80 test-cases. Now it was down to ~12. The key to this was the change in language. Rather than having a separate testcase to check each stat of a weapon it would be replaced with "Evaluate the stats of the weapon, determine if they fit with the assets purpose and ensure that they reflect the design principles".
This shift ensured that our testers looked at the stats collectively as well as individually and forced them to think about the context of the numbers rather than just checking that the value matches something stated in a Design Doc. The instruction to "Evaluate" gave team members a real sense of of ownership and that they had a real stake in the quality of the area that they where testing.
The majority of the revamped test-cases where now entered into the newly secured database which was now being backed up. A new model of the philosophy was documented and some training was given to the new approach and ethos that we where promoting. We monitored updates as they where released and we found that live bugs actually went down as a result of this initiative.
We learned a lot during this process and looking back, it was a blessing. Trying to effect change on entrenched processes is very difficult to pull off with transitional periods being awkward to navigate on their own. The loss of the database made the idea seem less risky than the alternative of a long winded restoration from memory that would never quite be the same as before. The loss also gave us some grace to try this new approach as we could weather any reduction in quality as we had the perfect scapegoat.
The incident also showed that QA artifacts such as test-plans are important assets that should require the same safeguarding as other business critical services and that we should have stronger plans in place for worst case scenarios. The most important learning was how well the team performed and adapted to the new shift in testing philosophy. Another outcome was that we now knew the reasoning behind executing the tests as we are now more likely to fix the root cause of issues and no longer having to check it forevermore, "just incase".