After several iterations with my test group, my graphs started to align to the bell curve I was looking for. It was time to ship the game, and I decided to leave the metrics system in place. I wondered if the data I collected from live users would look different from the data produced by my test group. There was only one way to find out.
Of course, any time an app reports data back to a server, it's best to let the user know about it. The first time Replica Island is launched, a welcome message appears that details the latest game improvements. That message also informs the user that anonymous, non-personal play data will be uploaded to a remote server in order to improve the game, and that players who do not wish to participate may turn the reporting system off in the options menu.
This approach seemed like the best solution: though the code is open source and anybody can look at the content of the data packet itself (and I ensured that nothing about the metrics data can be tied to any specific user or device), allowing users to opt-out gives them an opportunity to say "no thanks."
By comparing my Android Market installs with the number of unique users reporting in, it looks like less than 20 percent of my users chose to opt out of metrics disclosure.
As a result, I have a huge amount of data now -- over 14 million data points, close to a gigabyte of event information generated by my user base (which, as of this writing, is about 1.2 million players).
In fact, the volume of data broke my data processing tools pretty quickly; I have a snapshot of statistics from the first 13,000 players (which I have published on the Replica Island website), but after that, a lot of my tools failed. The good news is the first 13,000 players produced aggregate data that was very similar to the smaller test group, which probably means that the test group results can be applied to much larger groups of players.
I have been extremely satisfied with the event reporting system in Replica Island. For very little work, almost no cost (the server back end that records events costs less than an Xbox Live account), and using only two types of events, I was able to quickly and effectively identify areas where players were having trouble. Furthermore, once I started collecting this data, I was able to compare the aggregate result of my metrics between versions, which made it easier to see if my design changes were effective.
Using PHP and MySQL as my back end server language was a good choice; the actual recording of events is so trivial that I'm sure any language would have worked, but with PHP, the whole server took less than 30 minutes to put together.
Using a separate thread to report events from the game was a good move as well. I didn't want any sort of UI to block HTTP requests, and moving the web communication to a separate thread made sense, but I initially had some concerns about overhead. I needn't have worried; the overhead is so small, I can't even get it to show up in my profiler.
Finally, keeping the system as simple as possible was a really positive decision. I considered a lot of potential event candidates, but for my game, tracking player death and level completion provided more than enough information. More statistics would have complicated the processing of the data, and possibly made it harder to reduce the feedback to a concise view. Now that I've had some experience with automatic metrics reporting, I'll probably increase the volume of data that I send back in the future, but starting simple was definitely a good move.
Not everything about the event reporting system worked out well, however. I made a few decisions that ultimately turned out poorly, or just wasted time.
Bitmap processing was particularly painful in PHP. I did all of the heat map generation in PHP, but I should have just written something that could run locally instead of on a web server. I ran into a number of bugs in the PHP GD interface (compositing bitmaps with alpha is pretty broken), and ended up having to reduce the size of my level art images in order to do the processing.
For this article, I rewrote this tool using Python and ImageMagick, and the results are far superior. I've provided the code for this implementation, which can be found at the official Game Developer magazine website.
Finally, though this data tells me all about where players die and how long it takes them to complete levels, it doesn't help me identify shelf moments that are not related to death. I ended up shipping with a few key level design failures that my metrics never caught; in the most egregious case, players get stuck at a puzzle where they do not understand how to progress, and end up giving up before they complete the level.
This never shows up in my metrics because an event condition is never reached; I only learned about it when users started complaining about being stuck in the same spot. Automatic metrics are super-useful, but they can't show you a complete view of the game. In my case, the metrics were good at finding problematic level layouts but were particularly ineffective at identifying design failures related to rule communication.