Automated Tests and Continuous Integration in Game Projects
March 29, 2005
Many game projects are either significantly delayed or shipped in a rather buggy state. Certainly, this situation isn't unique to the games industry - for instance, according to the infamous "Extreme Chaos" report released by The Standish Group in 2001, more than 70% of all software projects are either cancelled or significantly exceed their planned development time and budget. However, since games represent a very complex case of software development where people skilled in rather different disciplines have to cooperate, one might argue that the development risks inherent in game projects are particularly high.
The reasons for delayed, bug-infested or even failed software projects are manifold, but it seems that, besides feature creep and shifting priorities, testing and quality assurance are recurring themes. In our experience, a large number of development studios entirely rely on manual testing of the underlying game engine, their production tools, and the game code itself, while automated processes are only rarely adopted. Similarly, in the 2002 GDC roundtable "By the Books: Solid Software Engineering for Games", only 18 percent of the attendees said that the projects they were working on employed automated tests.
We were first confronted with the notion of automated testing when, in the year 2000, customers of our back then still very young middleware company complained about stability issues and bugs in our 3D engine. Until that time, we had relied on manual tests performed by the developers after implementing a new feature, and on the reports of our internal demo writers who were using these features to create technology demos for marketing purposes. After thoroughly analyzing the situation, we came to the conclusion that our quality problems were mostly related to the way we were testing our software:
testing wasn't performed thoroughly enough, because it simply
took too much time. Whenever some code was changed, or new code
was added, it would have been necessary to execute a defined set
of manual tests to make sure the modifications hadn't introduced
problems anywhere else. Manual testing took more and more time,
which lead to frustration on the side of the developers and reduced
their motivation to actually execute the tests. Additionally,
the amount of work involved in testing made developers reluctant
to improve or optimize existing code.
- When developers manually tested their own code, they often showed a certain (subconscious?) tendency to avoid the most critical test cases, so the scenarios a problem was most likely to occur in were also the situations least likely to be tested.
As a result, we decided to adopt automated testing, starting with a new component of our SDK which we had just started to develop. The results were encouraging, so we finally expanded our practice of automated testing to all SDK components.
Automated tests have become popular with eXtreme Programming, a collection of methodologies and best practices developed by Kent Beck and Martin Fowler. Generally, automated tests refer to code or data that is used to verify the functionality of subsets of a software product without any user interaction. This may range from tests for individual methods of a specific class (commonly called unit tests) to integrated tests for the functionality of a whole program (functional tests).
In order to facilitate the creation of automated tests, there are a number of open-source unit testing frameworks, such as CPPunit (for C++ Code) or Nunit (for .NET Code). These testing frameworks provide a GUI to select the tests to run and to provide feedback about the test results. Depending on your project, it might be necessary to extend these frameworks with additional functionality required for your game, such as support for multiple target platforms.
In the context of such a testing framework, a single unit test corresponds to one function, and multiple unit tests are aggregated in test classes along with methods for initialising and de-initializing a test (e.g. loading and unloading a map). These test classes can in turn either be located in a separate executable - for instance, when the code to be tested resides in its own DLL - or in the main project itself. Regardless of this, test classes should always be stored in different files than your production code, so they can conveniently be removed from builds intended for deployment.
Sample Code for simple tests of a physics system. If any of the VTEST conditions are false, the test fails.
What should be tested?
Pragmatism is a virtue when it comes to deciding what to test. Usually, it doesn't make sense to write unit tests for functionality with minimal complexity, such as the getter and setter methods for individual properties of a class. In order for automated tests to pay off, the code to be tested should have a certain probability of producing incorrect results, like, for instance, a method that casts a ray through a game level and returns whether this ray intersects any level geometry (line of sight test). Such a test would then compare the returned result with the expected result provided by the test author.
A recurring question is whether tests should be written only against the public interface of a class (so-called black-box tests), or whether they should take the inner workings of a class into account by also covering private members (white-box tests). While black-box tests are usually somewhat coarser than white-box tests - after all, they can only check the final results of an operation, but no internal intermediate states -, they are significantly less sensitive to modifications of the tested code. The line of sight function mentioned earlier may undergo significant internal changes up to a complete re-write (e.g. because the original version simply was not fast enough), but the results it returns remain the same. In such a case, a white-box test almost always has to be re-written or modified along with the tested code, whereas a black-box test could immediately be used to check whether the modified code still produces the same results. Thus, we have found it beneficial to only include the public members of a class in automated tests, since in most cases, the inner workings of a class change more frequently than its external interface.
In many cases, especially in the field of game development, it is not feasible to compare results against data manually provided by the test author. For instance, if a collision detection routine computes intersection points with complex geometry, manually providing reference data for the tests is hardly an option. Instead, test results can be compared against previously generated data from earlier versions of your code, a practice that is also known as "regression testing". This reference data has to be reviewed by the test author - for instance, using a simple visualization of the colliding objects - , and once it has been approved, it can continuously be used for testing. This way, automated tests help you ensure that the new (e.g. optimized) version of your code still produces the same results as previous implementations.
For functional tests of code that generates highly complex output data, such as a game's rendering engine, regression testing is often the only feasible way of implementing automated tests. In the case of the Vision rendering engine, we ended up generating platform-specific reference images from all visual tests. Whenever the automated tests are run, the rendered images are compared against the reference images pixel by pixel, and if the images differ, the test fails. In order to keep the memory impact of the reference images reasonably low, you can bind comparison snapshots to certain events in the tests.
Such visual regression tests have the advantage that even minor visual errors frequently overlooked in manual tests never remain unnoticed. Unless they know the scene extremely well, few people will realize that a shadow or a single object is missing in a complex scene, or that the red and blue values of a light source's color are swapped. Regression tests, however, will almost certainly detect bugs like these.
In any case, it is important that generating the reference data for a regression test is an automated process. The reference data may be platform-specific, especially when it comes to rendering output, so it might have to be generated multiple times, even more so when there may still be changes in the rendering pipeline that cause intended differences in the rendered images. In order not to discourage developers from writing regression tests, they should be able to create new reference data ideally by simply clicking a button in the test framework's user interface.
How everything fits together
For almost all applications - games included - a complete test suite consists of both unit and regression tests. Unit tests are suitable for low-level functionality and base libraries and ensure that you have a solid foundation to build higher-level code on. Regression tests can in turn be used to perform integrated functional checks of higher-level features. As a result, you can refactor or optimize complete functional groups of your game or engine code, and you will immediately notice if something breaks during the process, since the regression tests will fail. Furthermore, failing unit tests will often give you a rather precise indication of what actually goes wrong.
Since it is always beneficial to know how much of your code is actually covered by the automated tests you've written, you may want to use a code coverage tool such as BullseyeCoverage or AQtime. A code coverage analysis tells you which parts of the code have actually been called, and thus also provides hints about "holes" in the test suite. The question how high test coverage should be cannot be answered easily, though, since it largely depends on the code to be tested. Trivial methods do not have to be covered by automated tests, and the same naturally goes for pure debug functionality. Also, almost all projects contain "dead" code that is never called, and such code naturally also doesn't have to be tested. In total, the real-world game and middleware projects with automated tests that we have seen had a test coverage of between 55 and 70 per cent.
Writing test-friendly code
Admittedly, automated tests are not equally easy to implement for all types of code. As far as unit tests are concerned, a strictly object-oriented, modular design with separate functionality encapsulated in separate classes significantly facilitates testing. The more information a class needs from the outside world, the more work it is to write unit tests for that class. Also, excessive usage of the "friend" modifier in C++ can make it difficult or even impossible to write (black-box) unit tests for a class.
It is always best to keep testability in mind already when writing the code. Making code testable later on in the development process is still possible, but it can be a rather tedious task, as it sometimes requires quite a bit of restructuring. When it comes to games, there are a number of important aspects that should be considered when developing testable code:
regression tests to rely on deterministic behavior. For instance,
a pathfinding system that uses randomness to make the decisions
of characters less predictable could provide a public method for
initializing the seed. This method could then be used by the tests
to ensure that the characters always take the same path.
Similarly, avoid frame-rate dependencies in regression tests;
otherwise, physics objects or rendering output may differ from
previously generated data, especially if that data was generated
on a different machine or with a different build configuration
(e.g. debug vs. release). One way to achieve this is to run rendering
and simulation loops with a constant virtual frame rate during
the automated tests.
that heavily relies on user input, such as an in-game editing
system or production tools, is usually rather difficult to test.
In such cases, a strict separation of UI and logic code can make
testing easier: In our production tools, for instance, every user
action in the UI executes one or more simple script commands.
A set of script commands can then simply be used to produce an
exact imitation of what the user originally did. A test can simply
execute this code and compare the result (e.g. exported file,
scene geometry) against existing reference data.
GUI capturing tools are another option for user interface testing, but we generally don't find them particularly recommendable. User Interfaces tend to change frequently, and since moving a button by a few pixels may already invalidate previously captured user input, tests using GUI capturing may hinder your workflow rather than supporting it.
Common concerns regarding tests: Do we really save time?
In most cases where a development team is about to introduce automated tests into their development process, there are at least a few people in the team who are skeptical about it. After all, implementing automated tests takes time that could otherwise be spent working on game or engine code. According to our experience with automated testing inside and outside of the games industry, the additional time that a team spends writing test code indeed amounts to around 30 percent of the total implementation effort. At the first glance, this might seem like a huge expense in time and money; however, you have to count this against the time saved by not having to perform the same manual tests again and again.
While automated tests usually mean an investment in the beginning, they pay off later in the development process. Most of the changes made to existing code, including bug fixes, have a certain potential for side effects that cause other functionality in a game to break. Therefore, it would theoretically be necessary to thoroughly test all potentially affected parts of the code whenever a change is made. Automated tests can perform this validation as often as you want without any user interaction, so they save time throughout the whole development process. What is more, automated tests encourage developers to optimize and improve existing code, since they have a simple and quick way of finding out whether the modified code still works correctly.
In our experience, the introduction of automated tests helps developers write more stable and reliable code. It provides early feedback, which is usually highly appreciated even by team members who were skeptical about automated tests at first, and it leads to bugs being discovered earlier in the production of a game. Since the pressure and workload on developers tends to increase as the project approaches release, finding and removing bugs early avoids additional stress in the most critical phase of development.
During the development of the Vision engine, we collected some data to monitor the effectiveness of our automated tests in the improvement of code stability. When the first version of the engine was released in early 2001, we relied entirely on manual tests, and even though new versions were thoroughly checked, our customers reported more than 100 issues every month in our online support database. In September 2001, we started implementing automated tests for the existing engine functionality, and also added tests for all new features that were implemented. As a result, the number of issues reported each month dropped to a fraction of the initial value (now about five to ten), even though there are now six times as many companies working with the technology, and even though development has progressed in a rather constant pace.
Support issues and number of automated tests in the Vision engine from 2001 to 2004
Admittedly, these figures simply indicate a (negative) correlation between the number of unit tests and number of support issues per month, and don't have to be interpreted as causality. Certainly, our experience in developing robust code has grown from 2001 to 2004, and the size of the development team also varied within this time frame. However, the differences are big enough to support the notion that at least part of the gain in stability may be attributed to the introduction of automated tests.
Limitations of automated tests in Games
As beneficial as automated tests are, there still are aspects of game development that don't lend themselves well to automated tests. Naturally, it is difficult to test whether a game is well-balanced, and it's probably impossible to write an automated test which analyzes whether a game is fun to play or whether it looks good. In the course of the last few years, we have set up some internal guidelines for writing automated tests, the most important of which are:
Concentrate on the most important (i.e., the most complex and
most frequently used) modules when introducing automated tests.
Start implementing automated tests where they are most likely
to provide a benefit, for instance because they're likely to fail,
or because they help you perform a refactoring task without breaking
on testing the different subsystems of your application whenever
a higher-level functional test doesn't seem to be possible. For
example, you might not be able to automatically verify that the
complete AI system works properly, but it is well possible to
test whether a monster reaches the state "surrendered"
when its damage exceeds a certain value.
stress tests to verify the robustness of your code. If your game
runs stable under extreme conditions - for example, when 2000
monsters are spawned and destroyed every second, when 500 physics
objects are simultaneously thrown into a scene, or when a map
is reloaded 200 times in a row - it is also less likely to break
when players try something unusual.
test cases for bugs before you start fixing them. Having such
test cases will make sure that bugs won't re-emerge in future
versions of the game.
tests - for instance, image or state comparisons - are easiest
to maintain when they use special test scenes rather than production
maps. If you believe that the production data relevant for a test
may still be changed frequently, it is usually better to use a
small test scene instead. Otherwise, there is a certain risk of
reference data having to be generated and reviewed so frequently
that the development team loses motivation to execute the automated
- Keep your tests as simple as possible instead of trying to achieve extreme test coverage values. Setting up automated tests is a long-term project where maintainability and extensibility are crucial factors.
In general, "low-level" code such as math, collision detection, and even rendering is easier to test in an automated fashion than gameplay, and even a game with a comprehensive suite of automated tests will still have to go through the hands of QA people. However, the focus of the QA department will most likely shift from technology- to gameplay-related defects and shortcomings. Instead of "I have these distorted triangles on screen whenever my character turns", the issue reports may contain statements like "It is possible to get into room A, but you can't get out of it again since the crates in front of the ventilation shaft are too high."
When employing automated tests in the development of a complex software project, you will soon realize that the execution of automated tests takes time - up to a couple of hours in some real-world projects we've seen. If developers have to execute these tests on their development machines, they will quickly become reluctant to actually run the automated tests, since it might stall their work while the tests are running. Of course, tests that aren't executed are absolutely worthless.
The solution to this problem is to set up one or (preferably) multiple computers dedicated to executing the automated tests. Such a machine regularly polls the version control system (e.g., Subversion, CVS, Perforce) for changes in the respective repositories, and if newly committed changes are found, the code is checked out and compiled, and the tests are run. Finally, the system sends an email with a report containing the test results to the developer who has performed the last commit operation.
This concept of fully automated and reproducible build and testing processes which typically run multiple times per day is called "continuous integration" - once again a term that has its roots in eXtreme Programming. In order to facilitate continuous integration, there are open-source tools like Cruise Control or AntHill which take care of the interaction with the version control system and additionally provide an interface for build tools such as ANT. Using these tools, a custom continuous integration system can quite easily be set up.
We've found that setting up dedicated CI servers smoothed the development processes in our organization significantly, and indeed gave developers more time to work productively. Also, since they didn't have to care about executing the automated tests any more, we could be sure that the tests would always run successfully, since faulty code would result in the CI systems complaining to the responsible developer (and the project manager) by email.
Our positive experience with the introduction of automated tests and continuous integration fuelled our search for further processes in game and tools development that lend themselves for automation. For instance, a CI server nowadays automatically generates our CHF (windows help file) documentation from a Wiki every time a modification in the Wiki is detected. Furthermore, creating distributable copies of an arbitrary software product can easily be automated using ANT and CruiseControl. This way, creating a full distributable copy of the most recent code (or the last stable tag of it) becomes a matter of minutes.
The most recent addition to our collection of automated processes is automated performance tests. They are based on the same test framework as the regular unit and regression tests, but instead of checking for correctness, they measure the engine's performance and compare it to the best previous runs of the same test on the same machine (the results of an arbitrary number of system configurations can be stored in a version-controlled XML file). If the current result is significantly slower than the reference result, the test fails; if the current result is better than the reference result, it becomes the new reference result.
Performance tests actually are a special type of regression test. They allow us to make sure that, when engine code is modified during refactoring processes, these changes never cause any part of the engine to become less efficient. This creates a certain pressure to keep code fast, and also makes sure that when optimisations are performed, you don't run into a scenario where other parts of the code suddenly become slower.
In our experience, the introduction of automated tests and continuous integration makes development teams more efficient and results in more reliable, and often quite simply, better software. Additionally, it reduces the pressure and workload on development teams by reducing the effort for manual testing, and allows bugs to be found earlier in the development process.
Certainly, automated tests alone won't make your game a hit. But almost as certainly, they will make life easier for developers, artists, project leaders, producers and even players.
- Unit Tests: http://c2.com/cgi/wiki?UnitTest
Standish Group, Inc. (2001), Extreme Chaos
the Books: Solid Software Engineering for Games (GDC 2002 roundtable):
- ANT: http://ant.apache.org