Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Gamasutra: The Art & Business of Making Gamesspacer
Rising from the Ranks: Rating for Multiplayer Games
View All     RSS
May 22, 2019
arrowPress Releases
May 22, 2019
Games Press
View All     RSS








If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 

Rising from the Ranks: Rating for Multiplayer Games


February 9, 2000 Article Start Page 1 of 4 Next
 

Scores and ranks have been used in arcade and single player games from the early days on. Players itch to compare their performance, and they want to know where their peers and equals are. With games like Quake3: Arena and Unreal Tournament geared towards one-on-one and free-for-all deathmatch, the question of proper rating and ranking is bound to return with a vengeance. Let's settle this once and for all.

The Fun Factor

If I were pressed to define what makes an interactive game "fun", I would put my tongue firmly in my cheek and come up with something you could call a "control theory of games": A game can be experienced as "fun" if you can make a difference.

Take this as the "weak" causality of fun. It merely requires the game to be "interactive". In its minimal form it implies that your actions will have some effect on the outcome. In other words: if you might as well be absent, chances are you should leave.

Affecting the chain of game events doesn't necessarily mean that you are winning the match, which leads us to the "strong" causality of fun: A game can be experienced as "fun" only if you can win. The elusive quality we call fun originates from the confidence that you can, and might have, won the game - that your actions can indeed affect the chain of game events, maybe enough to let you come out on top. You don't have to win to enjoy the game, but you expect a fighting chance.

In his Glory and Shame, Jonathan Baron describes competitive multiplayer reality, and postulates that public humiliation drives away the casual gamer. Online gaming, especially when implemented as free for all deatchmatch, or one on one tournaments, offers "strong fun" only to a few, and little of the "weak fun" for many, if not most. Of course, not everybody can win, and certainly not at the same time. Yet, there are means to make the game more fun for more people. Inevitably, any such attempt requires a game to estimate player skill, and either match equally skilled players, or compensate skill difference

Estimating player skill is where it all begins, and that's what rating is all about. You are in for a bit of theory here, so I will give you the bottom line right away: if you are doing a competitive multiplayer game without proper rating, then your game design has a potentially fatal flaw.

DOOM introduced "frag" counting, and with it the notion that all players are equal. This, however, is far from true. Competitive multiplayer gaming has yet to attract the casual gamer, to find success beyond the hardcore gaming crowd. One reason is that frag counting encourages the strong player to prey on the weak. Scores create feedback loops: the winning strategy that "solves" frag counting is to find and kill the weakest player, thusly giving them the kind of initiation that will drive them out quickly, no matter how easy your GUI makes it to get on a server and join a game.

Skill estimates based on proper rating are the first step to change this inherently imbalanced game mechanics. Better yet, it takes only about a dozen lines of code to add a mathematically sound rating calculation to your game. You can even apply it on a per-session basis without introducing persistent player IDs (see Dynamix' Tribes2 for an example of player ID management), and still improve the situation.

Intermission Screen

Definitions: a score; is a simple counter triggered by events, e.g. "kills accomplished", "total number of items" picked up so far. Ranking is a measure by direct comparison: A beats B, thus A ranks above B. The ladder sorts all players based on the outcome of their last direct encounter: your place in the ladder is your ranking.

Rating extends ranking by introducing a scale: A is twice as good as B. Rating also differs from a score in that we do no longer simply count events - we weigh them. You will sort players by their rating, and derive ranks, but there is a bit more to it: rating attempts to measure your actual ability. Sports and competitive games have used rating mechanisms for decades now. The state of the art pretty much dates back to 1960, when Arpad Elo revised the rating system for chess, and we are following his footsteps.

From single player game scores, we can learn the properties any successful rating algorithm should have: it must be

  • Reproducible
  • Comprehensive
  • Sufficiently detailed
  • Well balanced

We need reproducible results. Given identical performance, the algorithm should end up with nearly identical scores. Players with comparable ability should end up with comparable rank. Reproducible also implies that the meaning of a given score should not change over time. This might be irrelevant for games with an estimated lifetime of a fiscal quarter, but the shelf live multiplayer games extends farther and farther.

Our scores also have to be comprehensive. The average player has to instantly and intuitively understand scores. Scores have to match the player's ideas of significance. Common examples like "confirmed kills", "deaths suffered", "shots fired", "hits per shot", "frags per minute" make sense even without conmparing them to other player's results.

Such performance indicators are also detailed enough. Detail and balance issues are raised by multiple scoring, a common problem for single player, but also for team multiplayer. Single player DOOM listed time, kills, items, and secret areas revealed - a combination that allowed players to subscribe to different styles like "fast" or "thorough". Multiplayer DOOM had only once score, the now classic "frag", and we will take this as our starting point. For the pure and simple point-and-click multiplayer FPS shooting game - no teams, not flags - the kill or "frag" is the base unit of measurement.

The whole purpose of scoring is compressing minutes if not hours of gameplay into a single number, for ease of comparison and ease of communication (i.e. boasting). For frag counting, a kill is a kill is a kill. Yet, human players are not at all identical. If all frags count equally, we encourage "newbie hunting", a gameflow in which all players dominate by preying upon the easiest victims, ignoring their peers - net result being an utterly imbalanced game. Any capable player who chooses to spare the struggling beginners will be beaten by those who target them specifically.

To change this, we have to assign a value to each single frag, rewarding those that target their peers. In multiplayer deathmatch, the quality of a kill does matter - the skill of your opponents matters. But how do we measure this elusive quality, skill?

Figure 1. Example skill distribution for 50 players

Jacob's Ladder

Nobody really cares about numbers. The only thing the social animal does care about is rank. Your first attempt at a solution could be a ladder. Players join by challenges, from the bottom, or by choosing an opponent. Each time two players fight, the winner gets the rank of the loser, or keeps her own rank, whichever being the higher rank. Draws will be ignored. A ladder lacks all hints on how large the difference in abilities between two players really is, or any other linear scale of reference to compare players, thus making it hard to determine a worthwhile challenge. There is also no straightforward way two players could have equal rank.

The ladder has a few things going for it, though. The history of the Elo system teaches us that simplicity is an important property. Simple solutions are good solutions. The less detailed a ranking scheme, the fewer free parameters, the less argument between players. The process of match-and-swap offers us a first insight into the true nature of ranking: we are actually trying to solve a minimization problem here, namely an error function like

Error = sum over all players |true rank-current rank|^2

We have no means to measure the true rank except by comparison with another player - in other words, by observing a game. Each actual match is a sample, a data point during our quest for the true player hierarchy. Ranks might be time-dependend (if player abilities change), and our results are blurred by noise (because details we know nothing about may have affected the outcome of a match), but this is as close as we get. We sample (let 'em play), and update our estimates according to the result, hoping that, more often than not, we are getting closer to the truth, and that things will sort itself out as long as people keep playing all over the board.

You can see that this minimization task is combinatorial by nature (an assignment problem): all permutations of players on ranks (without repetitions) are possible solutions. This is a well known, hard problem, and our approach is equally well-known, and usually called iterative improvement by pairwise exchange:

if ( loser-rank = winner-rank )
swap( winner-rank, loser-rank )

We gained nothing if the outcome of the fight matches our expectations (we expect a higher-ranked player to win), otherwise all we know is that whatever the real ranks, changing places will at least get us closer. All we care about is who won - it does not matter whether the score was 2:1 frags, 100:99 or 10:1.

It might be necessary to introduce a penalty for ignoring challenges, as well as rematch regulations and random selection of opponents, or other incentives to keep things going. Pairwise exchange means high risk for the high-ranked player, and no risk for a low-ranked player. Restricting challenges to neighboring ranks slows down the dynamics. This means less instability if a game is not too reproducible (random noise affects results), but it will take much longer for the system to reach equilibrium and minimize the error. We will encounter this trade-off again.

Finally, pairwise exchange based on final frag count means that you have to enforce one-on-one gaming. If you try this in a Free-For-All situation, your results might turn out to be useless. Ladders are tools for tournaments, not so much for multiplayer crowds. However, ladders might work for more than two players on a per-game basis if you determine winner and loser per event, per frag. Iterative improvement works best if you swap often, to use all information available, and to avoid accumulated errors. The more players participate, the more adequate the ranking - another important lesson. The sorting obtained on a single server can afterwards be used to re-assign the slots in a global ladder as well.

Figure 2. Example skill distribution for 500 players


Article Start Page 1 of 4 Next

Related Jobs

Cold War Game Studios, Inc.
Cold War Game Studios, Inc. — Remote, California, United States
[05.21.19]

Level / Encounter Designer
Sucker Punch Productions
Sucker Punch Productions — Bellevue, Washington, United States
[05.19.19]

QA Manager
Dream Harvest
Dream Harvest — Brighton, England, United Kingdom
[05.18.19]

Technical Game Designer
Deep Silver Volition
Deep Silver Volition — Champaign, Illinois, United States
[05.17.19]

System Designer (Player Progression)





Loading Comments

loader image