Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Gamasutra: The Art & Business of Making Gamesspacer
View All     RSS
September 30, 2020
arrowPress Releases

If you enjoy reading this site, you might also want to check out these UBM Tech sites:


A/B Tests for Analysing LiveOps. Part 2

by Eugene Matveev on 04/06/20 10:34:00 am   Featured Blogs

The following blog post, unless otherwise noted, was written by a member of Gamasutra’s community.
The thoughts and opinions expressed are those of the writer and not Gamasutra or its parent company.


I continue with my series of articles about A/B testing of LiveOps. In the previous article, I talked about ideas for A/B tests, deteriorating tests,  A/A and A/A/B tests. This time I will explain how to choose the right metrics, generate options and prepare a sample.

Metric Choice

It is better to use the metrics to which you can apply cohort analysis. These are the metrics that tell us about the quality of the project and are tied to the specific registration date. E.g.

  • Conversions of various types (for example, what percentage of users converted to payment at the end of the week, or what percentage of users converted to fifth and higher payments at the end of the month).

  • D1, D7, D30 retention.

  • Monetization metrics, e.g. ARPU, first payment, etc.

  • Any other custom metrics that you want to influence with your test.

Take, for example, cumulative ARPU metric:

Cumulative ARPU

If we look at the day 7 cumulative ARPU on this chart, we can understand how much money a user pays for their first 7 days in the game. This is a very good metric, because it indirectly takes into account the retention and, at the same time, directly takes into account monetization. It all means that this metric very clearly indicates a qualitative change in the product. If the change we introduce is at least somewhat related to monetization, then this metric can be used for A/B testing.

Option Generation

It seems that an A/B test can only have two options - A and B, but in fact, there can be many more since there is also multivariate testing. For example, we want to run two tests - a price test and a text test. The price is $2 or $3 and the text is “buy now” or “buy”. We can run these tests either simultaneously (two tests at the same time), or sequentially (first we carry out one, then the other). The first option will be multivariate testing. That is, we have two price options and two text options, we multiply them and get 4 options and we check them at the same time. This is convenient only if we have a small number of options. But if we have more than two of them, for example, 10 color options and 10 button size options, then we get 100 options total. In this case, there will not be enough people in each group to achieve statistical significance (read about it in the next article).

Sample Preparation

The most important thing when running an A/B test is that you use a group of people that never encountered the changed functionality. Therefore, tests are often run on a tutorial or on the download form, that is, on those people who first install the game/application. You can also run an A/B test on experienced users who, for example, have never visited an in-game store. But most A/B tests are done on beginners.

When planning an A/B test, we need to understand:

  • How many groups do you plan to have?

  • What level of significance do you want to achieve (you need more people for the test if you don’t want to make a mistake)?

  • What kind of change are you making?

This is a rough formula that I do not recommend to use:

But it has an interesting indicator, which is worth paying attention to. The numerator has a squared standard deviation (the less stable the metric is, the more users we need to run the test), and the denominator is the change that we want to achieve. Thus, the smaller the change we want to achieve, the more users (squared!) we need for this. For example, if we have 30% retention, but by adding a Live Ops event, we want to make it 40%, then we use a small number of users to run the test. If retention was 30%, and we want to check whether it became 30.1%, and the result should be statistically significant, then we will need a lot more users.

Most importantly, the picking must be random. It is fundamentally wrong to send users who, for example, came on Wednesday, into one group, and users who came on Thursday, into another. These are different users, they are motivated differently, they joined the project on different days. The right way to do it is to combine both of them in one segment and randomly distribute them into two groups.

Next Monday I will post an article about results interpretation.

Related Jobs

New Moon Production
New Moon Production — Hamburg, Germany

Product Manager (all genders)
Sony PlayStation
Sony PlayStation — Bend, Oregon, United States

Lead FX Artist
Remedy Entertainment
Remedy Entertainment — Espoo, Finland

Development Manager (Xdev Team)
Remedy Entertainment
Remedy Entertainment — Espoo, Finland

Development Director

Loading Comments

loader image