|
Scores
and ranks have been used in arcade and single player games from the early
days on. Players itch to compare their performance, and they want to know
where their peers and equals are. With games like Quake3: Arena
and Unreal Tournament geared towards one-on-one and free-for-all
deathmatch, the question of proper rating and ranking is bound to return
with a vengeance. Let's settle this once and for all.
The
Fun Factor
If I were
pressed to define what makes an interactive game "fun", I would put my
tongue firmly in my cheek and come up with something you could call a
"control theory of games": A game can be experienced as "fun" if you
can make a difference.
Take this as the "weak" causality of fun. It merely requires the game
to be "interactive". In its minimal form it implies that your actions
will have some effect on the outcome. In other words: if you might as
well be absent, chances are you should leave.
Affecting
the chain of game events doesn't necessarily mean that you are winning
the match, which leads us to the "strong" causality of fun: A game
can be experienced as "fun" only if you can win. The elusive quality
we call fun originates from the confidence that you can, and might have,
won the game - that your actions can indeed affect the chain of game events,
maybe enough to let you come out on top. You don't have to win to enjoy
the game, but you expect a fighting chance.
In his Glory
and Shame, Jonathan Baron describes competitive multiplayer reality,
and postulates that public humiliation drives away the casual gamer. Online
gaming, especially when implemented as free for all deatchmatch, or one
on one tournaments, offers "strong fun" only to a few, and little of the
"weak fun" for many, if not most. Of course, not everybody can win, and
certainly not at the same time. Yet, there are means to make the game
more fun for more people. Inevitably, any such attempt requires a game
to estimate player skill, and
either match equally skilled players, or compensate skill difference
Estimating
player skill is where it all begins, and that's what rating is all about.
You are in for a bit of theory here, so I will give you the bottom line
right away: if you are doing a competitive multiplayer game without proper
rating, then your game design has a potentially fatal flaw.
DOOM
introduced "frag" counting, and with it the notion that all players are
equal. This, however, is far from true. Competitive multiplayer gaming
has yet to attract the casual gamer, to find success beyond the hardcore
gaming crowd. One reason is that frag counting encourages the strong player
to prey on the weak. Scores create feedback loops: the winning strategy
that "solves" frag counting is to find and kill the weakest player, thusly
giving them the kind of initiation that will drive them out quickly, no
matter how easy your GUI makes it to get on a server and join a game.
Skill estimates
based on proper rating are the first step to change this inherently imbalanced
game mechanics. Better yet, it takes only about a dozen lines of code
to add a mathematically sound rating calculation to your game. You can
even apply it on a per-session basis without introducing persistent player
IDs (see Dynamix' Tribes2
for an example of player ID management), and still improve the situation.
Intermission
Screen
Definitions:
a score; is a simple counter triggered by events, e.g. "kills
accomplished", "total number of items" picked up so far.
Ranking is a measure by direct comparison: A beats B, thus A ranks
above B. The ladder sorts all players based on the outcome of their
last direct encounter: your place in the ladder is your ranking.
Rating
extends ranking by introducing a scale: A is twice as good as B.
Rating also differs from a score in that we do no longer simply count
events - we weigh them. You will sort players by their rating, and derive
ranks, but there is a bit more to it: rating attempts to measure your
actual ability. Sports and competitive games have used rating mechanisms
for decades now. The state of the art pretty much dates back to 1960,
when Arpad Elo revised the rating system for chess, and we are following
his footsteps.
From single
player game scores, we can learn the properties any successful rating
algorithm should have: it must be
- Reproducible
- Comprehensive
- Sufficiently detailed
- Well balanced
We need
reproducible results. Given identical performance, the algorithm should
end up with nearly identical scores. Players with comparable ability should
end up with comparable rank. Reproducible also implies that the meaning
of a given score should not change over time. This might be irrelevant
for games with an estimated lifetime of a fiscal quarter, but the shelf
live multiplayer games extends farther and farther.
Our scores
also have to be comprehensive. The average player has to instantly and
intuitively understand scores. Scores have to match the player's ideas
of significance. Common examples like "confirmed kills", "deaths
suffered", "shots fired", "hits per shot", "frags
per minute" make sense even without conmparing them to other player's
results.
Such performance
indicators are also detailed enough. Detail and balance issues are raised
by multiple scoring, a common problem for single player, but also for
team multiplayer. Single player DOOM listed time, kills, items,
and secret areas revealed - a combination that allowed players to subscribe
to different styles like "fast" or "thorough". Multiplayer
DOOM had only once score, the now classic "frag", and
we will take this as our starting point. For the pure and simple point-and-click
multiplayer FPS shooting game - no teams, not flags - the kill or "frag"
is the base unit of measurement.
The whole
purpose of scoring is compressing minutes if not hours of gameplay into
a single number, for ease of comparison and ease of communication (i.e.
boasting). For frag counting, a kill is a kill is a kill. Yet, human players
are not at all identical. If all frags count equally, we encourage "newbie
hunting", a gameflow in which all players dominate by preying upon the
easiest victims, ignoring their peers - net result being an utterly imbalanced
game. Any capable player who chooses to spare the struggling beginners
will be beaten by those who target them specifically.
To change
this, we have to assign a value to each single frag, rewarding those that
target their peers. In multiplayer deathmatch, the quality of a kill does
matter - the skill of your opponents matters. But how do we measure this
elusive quality, skill?
 |
|
Figure
1. Example skill distribution for 50 players
|
Jacob's
Ladder
Nobody really
cares about numbers. The only thing the social animal does care about
is rank. Your first attempt at a solution could be a ladder. Players join
by challenges, from the bottom, or by choosing an opponent. Each time
two players fight, the winner gets the rank of the loser, or keeps her
own rank, whichever being the higher rank. Draws will be ignored. A ladder
lacks all hints on how large the difference in abilities between two players
really is, or any other linear scale of reference to compare players,
thus making it hard to determine a worthwhile challenge. There is also
no straightforward way two players could have equal rank.
The ladder
has a few things going for it, though. The history of the Elo system teaches
us that simplicity is an important property. Simple solutions are good
solutions. The less detailed a ranking scheme, the fewer free parameters,
the less argument between players. The process of match-and-swap offers
us a first insight into the true nature of ranking: we are actually trying
to solve a minimization problem here, namely an error function like
Error = sum over all players |true rank-current rank|^2
We have
no means to measure the true rank except by comparison with another player
- in other words, by observing a game. Each actual match is a sample,
a data point during our quest for the true player hierarchy. Ranks might
be time-dependend (if player abilities change), and our results are blurred
by noise (because details we know nothing about may have affected the
outcome of a match), but this is as close as we get. We sample (let 'em
play), and update our estimates according to the result, hoping that,
more often than not, we are getting closer to the truth, and that things
will sort itself out as long as people keep playing all over the board.
You can
see that this minimization task is combinatorial by nature (an assignment
problem): all permutations of players on ranks (without repetitions) are
possible solutions. This is a well known, hard problem, and our approach
is equally well-known, and usually called iterative improvement by pairwise
exchange:
if ( loser-rank = winner-rank )
swap( winner-rank, loser-rank
)
We gained
nothing if the outcome of the fight matches our expectations (we expect
a higher-ranked player to win), otherwise all we know is that whatever
the real ranks, changing places will at least get us closer. All we care
about is who won - it does not matter whether the score was 2:1 frags,
100:99 or 10:1.
It might
be necessary to introduce a penalty for ignoring challenges, as well as
rematch regulations and random selection of opponents, or other incentives
to keep things going. Pairwise exchange means high risk for the high-ranked
player, and no risk for a low-ranked player. Restricting challenges to
neighboring ranks slows down the dynamics. This means less instability
if a game is not too reproducible (random noise affects results), but
it will take much longer for the system to reach equilibrium and minimize
the error. We will encounter this trade-off again.
Finally,
pairwise exchange based on final frag count means that you have to enforce
one-on-one gaming. If you try this in a Free-For-All situation, your results
might turn out to be useless. Ladders are tools for tournaments, not so
much for multiplayer crowds. However, ladders might work for more than
two players on a per-game basis if you determine winner and loser per
event, per frag. Iterative improvement works best if you swap often, to
use all information available, and to avoid accumulated errors. The more
players participate, the more adequate the ranking - another important
lesson. The sorting obtained on a single server can afterwards be used
to re-assign the slots in a global ladder as well.
 |
|
Figure
2. Example skill distribution for 500 players
|
|