
Cyberspace
in the 21st Century: Stability Before Security
By
Crosbie
Fitch
Gamasutra
December
26 , 2001
URL: http://www.gamasutra.com/features/20011226/fitch_02.htm
Challenging the Assumptions
What is your unique
talent? Is it shifting boxes? Providing a service? Or is it perhaps creating
entertaining works of art? "You think that's air you're breathing?"
Wake up and challenge what you've accepted for so long you think it's critical
to survival. Don't do what you don't need to do.
Distribution, marketing, customer support, maintenance, etc. may be exciting to some, but games developers only need to develop games. There are many other areas that risk being thought of as part of the games development process, including technological, so even coders are not entirely innocent of an occasional pre-occupation with incidental concerns. Consider the following two points:
To make all our
lives much easier and far more focused on producing entertainment rather than
meeting marketing schedules, maintainability, bug-fixes, etc. all we have to
do is let the community look after everything except the game - we work on the
fun, and the punter plays with our work and pays us for it.
Many of you will
either think that that's the current situation, although expressed in a rather
simplified form, or that it's a patently obvious statement (and miss my point
completely). If you know where I'm going, bear with me.
If you're not
convinced
Let's say
we're creating a massive multiplayer game (50,000 players). It's going to have
a back-end server software component, and a front-end client component. It'll
also have some content that defines the game together with its glorious scenery.
The money minded would probably require some kind of subscription revenue model
which would require another load of system software for collecting payment,
authentication, digital rights management, and any other administration that
players (carbon-based money dispensers) might need.
So, which bit
is the game?
Let's
rule out everything apart from the essence of a game
the back-end and front-end
can be developed by the global Open Source community of people fanatical about
such things (and where would we be without fanatics eh?). If that kind of thing
grabs your attention, you can roll your sleeves up and join in. The fact of
the matter is, the world of gamers benefits by enjoying the continuous development
of more enhanced game infrastructures. This is something Microsoft hasn't been
slow to notice, given its production of DirectX. OpenGL is a similar kind of
thing, and there are Open Source equivalents bubbling under. We're beginning
to realize that there's not much point to these common infrastructures being
closed and owned tooth and nail by a few corporations. Sure, the corporations
are hoping to get some money out of their efforts, but the rest of the world
doesn't care if they go bankrupt tomorrow, it only cares for the quality of
the game (and thus indirectly the infrastructure) not that it should
necessarily be free of charge. It's just that Open Source is beginning to look
like it can produce a better product (even if in some areas it's not quite there
yet).
Unfortunately,
quite a few companies want to turn into Microsoft and so it's difficult to wean
them off the pursuit of developing yet another proprietary technology in order
to obtain the supposedly vast monopolized rewards. If you're in a company like
this (especially if it's one of the few that might actually be successful),
you have an uphill struggle turning them away from this path. You're faced with
arguing against many sacred cows, saying no to patents, no to closed source
technology, and no to closed communication. That's all the baggage that encumbers
the dinosaurs of the industry. However, if you look at it closely, you'll realize
that you don't need it to make and sell games.
Therefore, we
end up concluding that the game is just the content that nebulous informational
entity that determines the graphics and gameplay. Everything else, while a familiar
sight in the game industry, and sometimes perceived as part of the game by its
consumers, may not actually be directly critical to the game development process.
Produce a game
and sell it what could be purer?
If the game's
infrastructure is Open Source, then it enjoys free distribution, free support,
free maintenance and free enhancement. The icing on the cake is that you're
not responsible for it either. If Napster was produced by the Open Source community,
and you (like some musicians) just happened to beneficently drop one of your
own recordings into it, who would be responsible for the fact that less scrupulous
users might drop copyright albums into it? I'll give you a few minutes to think
about that one
Note that I'm
not advocating the creation of technology in order to deprive artists of their
just rewards; I'm just using this as an example where owning technology can
be a double-edged sword. If you provide yourself as a target for litigation
then you'll surely become a target. If a technology provides a revolutionary
facility then there's no reason why you should become a scapegoat for that part
of industry that would rather see the revolution delayed for a few years.
If everything
except for the game's content is public domain, then the only thing you can
sell is the game. Let the players support themselves. Let the players petition
the Open Source community to improve the stability and security of the infrastructure.
Let the players pay their ISPs to deliver the software and host the servers.
Do you really want to do all this yourself? Do you really want to be obliged
to thousands of players to provide them with happy times?
Sure, money can
buy a game, but it can't buy guaranteed fun playing it. I'd suggest that you
seek to create a game that can be fun (for most people), but try to get out
of that deal as soon as possible. As long as your game has a reputation for
facilitating fun then you're ok, but you can't hope to hold each player's hand
each month and ensure they have fun. Running a massive crèche is a much
bigger job than making a game, and a tear-free experience is not something I'd
recommend anyone contract to (have you heard of a tear-free crèche?).
So, is anyone
mad enough to try this business model? Well, I don't know much about them, but
Nevrax (http://nevrax.com) appears
to be trying it, i.e. develop the game infrastructure Open Source, but obtain
revenue via the content the game itself. And yes, naturally, there are
a quite a few conventional and not so conventional revenue models you can choose
from (some depending upon copyright or encryption, some not).
Moving towards
Open Source middleware based platforms
The point
I've been leading up to is that the quality of the platform, whether hardware
gameconsole
or MMOG software infrastructure, doesn't have to be your problem, it can be
someone else's, and perhaps even the whole world's problem rather than one or
two manufacturers, middleware developers, or software houses. So let's just
get out of the mindset where it's up to you, the games developer, to worry about
offering the player a continuous glitch-free infrastructure and a guaranteed
tear-free experience just focus on the content, the essence of the game.
Of course, the
infrastructure's very interesting stuff and I'm sure many of you will want to
get involved in one or more Open Source projects (middleware or otherwise) alongside
producing games as an employee. Here's a link to some form letters that might
help you transition your occupational status from employee to employee and Open
Source worker: http://www.sage-au.org.au/osda.
Some enlightened employers might even see key benefits to you working on Open
Source projects as part of your employment.
I'm not simply
saying, just develop the content to the games developer. What I'm saying specifically
is wear two hats: one as a developer of proprietary games content, and the other
as a developer of Open Source infrastructure. This way you get to do the same
cool things as always, but you don't get confused into selling the technology
in with the game. In any case, as long as you are paid the same, and get the
same kudos, what's the difference? Oh, yeah
bye, bye technology licensing
deals crocodile tears to that.
And running
away from big brother
The benefit
of leaving the cyberspace platform to Open Source development is that you're
one step closer to seeing who's really concerned about stability and security
in cyberspace (pssst - it's the players). Well, ok, maybe it's obvious to some
of you, but in my experience, it's usually been subsumed into the corporation's
concern, i.e. the guys making the money.
Then it seems, people start forgetting what security is for. At best, you can
hope for a patronizing, paternalistic care for the player, but topmost is probably
a concern for protecting intellectual property, legal exposure, liability and
litigation, etc. There's enough blame on games already for causing society's
ills, so the last thing the games industry needs to add to its list of responsibilities
is community policing (in the virtual world).
Speaking of which,
I've always been surprised at how willing some chat services have been to take
on the responsibility for ensuring that users behave themselves. Sure, there's
'value added' benefit in selling a chat service that's safe for the family (if
that's a key market advantage), but if this gets extended via UO/AC/EQ into
cyberspace then players are going to end up spending the bulk of their money
paying to be policed - and much less on the entertainment (the fun bit). I'm
not saying cyberpolicing is a bad thing, or even that it's not going to
become a lucrative business, just that, again, it's not exactly the raison d'être
of games development.
Cyberspace will
be as big as the Web. And just as policing the Web is a separate issue from
developing content for it, the same will apply to cyberspace. For example, you
might produce "Spaghetti Western World". One universe based on this
theme could be un-policed. Another one could be policed. Why? To make sure players
don't get up to any naughty business, like exchanging addresses for secret rendezvous,
or details of cheap digital TV projectors legit, honest guy. If you're
looking for revenue models, perhaps in this case, it is that the un-policed
version is free.
Not that Kind
of Security
I haven't
really been on an irrelevant diversion so far. When you realize that cyberspace
will inevitably be developed by the Open Source community (or a compatible organization),
then you'll have a much better perspective in terms of where the issues regarding
stability and security are really coming from. They'll be entirely related to
assuring the quality of the player's experience. The games developer will only
need to worry about making sure their content tends to be entertaining, and
figuring out a revenue model (joining the similarly worried ranks of musicians
and movie moguls).
So there we are.
I've removed all the games developer's stability and security worries in a few
paragraphs. What a doddle eh?! Too fast for you?
Ok, so really,
I've just passed the burden for the infrastructure of cyberspace onto the Open
Source community, and cyber-policing onto society. You, however, now wear two
hats: the first as a game developer for a respectable game development company,
the second as an Open Source worker contributing to the development of cyberspace
(when you get a break from helping your landlady carry out her garbage).
I'll be dealing
with issues about producing content in a subsequent article - so put your first
hat away for next time. The rest of this article requires the second hat (yeah,
it might be red, at that).
Stability and
Security
So what's
next? Well, in previous articles we've looked at how it works and how it scales,
but now it's time to look at how it keeps on working.
Continuing with
the development of cyberspace as a distributed system of hierarchically related
nodes, let's explore how we go about keeping such a system working, i.e. stable
in the presence of failure and secure against attack.
As far as stability
is concerned we'll first look at how we ensure the system remains stable and
balanced in the absence of failure, but under arbitrary loading and stress,
and second how we can ensure the system is not critically perturbed in the presence
of expected failure.
Security is where we'll look at how the system can address unexpected failure
and sabotage, at the application, system, and network levels, and still maintain
stability and, in the case of sabotage, sustain minimal damage. However, although
I thought I might be able to squeeze security into this article, I've simply
run out of space, so you'll have to consider the next article as the second
half of this one...sorry.
As I've indicated
earlier, it's not critical to the system (or its ability to provide entertainment)
that it addresses issues of player behavior (lawful, socially acceptable), data
protection (privacy, intellectual property), and commercial viability (business
and revenue models), and so this area will not be much explored in this or the
next article. These issues may of course be addressed by anyone who wants to
perhaps if a commercial opportunity seems apparent?
Stability
A system is stable
if it has the ability to maintain satisfactory operation (equilibrium) even
in the face of destabilizing forces or events.
Some of the more
obvious destabilizing things our distributed system has to put up with will
be of the failure kind. However, there is also the workload aspect, i.e. stability
in terms of load balancing across the various resources it uses: processing,
storage, and bandwidth. Let's look at load balancing first. I've touched upon
it in previous articles, but let's see if I can shed a little more light on
load balancing here.
Load Balancing
We don't
want the system to have undamped lurching or cycling of resource consumption
between nodes or connections. Like two holiday resorts becoming alternately
popular because while one has empty beaches and bargain prices, the other hasn't
and everyone decides to go to the other place next year.
Now, you can get
yourself into all sorts of difficulties if you start making the load-balancing
bit of a system too complicated because by being part of the system it has to
address the complex behavior generated by its own complex balancing procedures.
A bit of a Heisenberg idea here, i.e. an unbalanced system (without a load balancing
component) may seem a doddle to balance, but once you introduce an extra balancing
component, you no longer have the simple system that you started with. So, if
you do address load balancing it's probably best to do it in a simple way and
let PhD students burn their young brains out coming up with refinements (whether
they add complexity or not).
So the approach
I'd suggest is to use a heuristic approach, i.e. make some intelligent guesses,
tweak it until it works and don't worry about why it works (unless someone pays
you to find out). This is just like hacking, but with a veneer of scientific
respectability.
We'll need heuristics for the organization of the relationships between nodes,
responsibility for objects, and heuristics for utilizing local resources. There
are probably a few others too, but we'll concentrate on the key ones.
Metrics
Many judgments
between peers will need to be based on mutual experience. For that matter, any
kind of decision often needs to be based on facts and figures. Our heuristics
will likewise need to refer to a large body of measurements, statistics, and
analyses thereof in making their decisions.
What are the sorts
of things we need to measure, and what kinds of measurements are likely to be
useful? We can quickly suggest the big three resources: storage, processing
and communication. We'll be interested in qualitative aspects as well as quantitative
ones not just how much, but how good. The other things we need to measure
will relate to the concepts that the system introduces, i.e. relationships,
responsibility, interest, and objects.
I've quickly created a few tables (one to eight) to give an idea of the sort of measurements we might make in a distributed modeling system.
![]() |
|
Tables
1-4: Measurements Upon Storage, Measurements Upon
Processing, Measurements Upon Communication, and Measurements Upon Relationships (NB Some of these measurements are made on each connection, with an aggregate measurement then also available). |
![]() |
|
Tables
5-8: Measurements Upon Responsibility, Measurements Upon
Interest, Measurements Upon Objects, Measurements Upon Reputation. |
You may notice
in tables one thru three that I've introduced the idea of integrity. Sounds
like it could be useful as a security measure of some sort, doesn't it? Well,
it can help towards that end, and when you see that an entirely new concept
reputation appears in table 8, you might begin to realize its purpose. Well,
I'll get on to that a bit later, but for now we just need to appreciate that
a variety of measurements need to be made simply in order for the self-organizing
nature of the system to operate. These measurements become the foundation upon
which the heuristics are based. However, while we're talking about foundations,
don't let that suggest you should take these measurements as cast in stone.
There may be others that I haven't listed that may be more useful.
Of course, for
entirely different purposes such as system monitoring, diagnostics, interesting
statistics, etc. there's a myriad of other measurements that can be made,
Heuristics
There are
many decision-making processes likely to be going on in the system, but as I've
already identified, the key ones are parent selection, peer selection, object
owner selection, and resource utilization. Each of these decisions will be governed
by a heuristic in the form of a weighted sum of various measurements that we
consider should affect the decision. Perhaps a genetic algorithm could let us
arrive at an optimal heuristic, but we'll make do with an intelligent guess
for the time being there's always empiricism to look forward to (a wet
finger in the wind).
![]() |
|
Table
9 - Decision to change parent.
|
![]() |
|
Table
10 - Decision to select a new peer or deselect a current peer.
|
Children are
implicitly competing for ownership of objects that they're interested in.
The parent node that owns an object is constantly evaluating the cost/benefit
of retaining responsibility vs. relinquishing it, whether responsibility
was better served if ownership was delegated to a child, and if so, which
child and the transition cost vs. benefit of change.
It is important to appreciate that ownership need not change rapidly. Ownership is not vital to any node, but it is advantageous for interesting objects.
Ownership decisions are being constantly evaluated all the way up the lineage from the current owner to the root. Thus, if ownership is changed fairly high up the lineage, notifications of change of ownership will filter down the lineage to the previous owner. It's likely that sometimes a decision to change ownership may occur at the same time at different nodes in the lineage. This doesn't really matter because a parent is always able to pull the rug from under the feet of its children so to speak.
Table 11 - Decision to change object owner.
The answer is to prioritize everything. This can be based on how important something is to the operation of the system, the operation of the game and the node's current interests.
Therefore, it involves the system designer, along with the game developer, player and any interested child nodes.
![]() |
|
Table
12 - Communication Utilization Policy
|
Tables 12 to 14 list a variety of factors that are likely to determine how resource utilization is prioritized.
Tables 13 and 14: Storage Utilization Policy, Processing Utilization
Policy (NB A non-mutating thread is one that is unable to cause a
persistent effect, e.g. adjust object properties, register interests
or events).
Equitability
In order
for the load to be balanced throughout the system, there is an implicit acceptance
by each participating node of an equitable contract, i.e. 'to each according
to their need (interest and capacity), and from each according to their ability
(knowledge and resources).' This doesn't necessarily mean that each node has
to be equal; it just means that each node is only permitted to make unlimited
requests to any other node on the understanding that it places no limit on the
requests it serves, which it will do to the best of its ability. NB this is
just a node thing, it doesn't mean that a player can't throttle bandwidth, or
limit the CPU and
RAM Available
to the System.
This reciprocity
is something that I have been assuming is self-evident; that the system only
works because it grows in power with each new node. If nodes join that are unwilling
to be anything but consumers (even if no other node would tend to select them
as peer or parent) then this defeats the distributed nature of the system, especially
its ability to balance the load.
Of course, I have
been assuming that each node runs the same software and cannot be selfishly
configured, but I expect it is still necessary to state this assumption and
its basis of equitability just in case it's not as obvious a requirement as
I'd thought.
The system works by harnessing each node's self-interest. In turn, each node
is obligated to service peer interests, and adopts interests of child nodes
(considering it a worthy parent) as though they were its own.
Guarantees
The system
should only support the basest of motivations, and just one at that, i.e. the
best modeled experience possible. That's another reason why this is a best effort
modeling system and not a best effort storage, processing or communications
system. Therefore, you may see a tendency for good storage, processing and communications;
however, none of these are actually the system's primary concern. Moreover,
being best effort by nature means that there are no guarantees. Indeed, it's
the lightest burden for both the system and each node, if neither requires guarantees
(doesn't crash in their absence) nor is obliged to provide them.
In this way, we arrive at a system that no one will pay for (it's only the money
minded that faint at this point). Its software is developed by the Open Source
community (probably), with the players providing the resources that it uses.
Neither developer nor player pursues to guarantee any level of service to each
other, but there is a gentleman's agreement that participation is on an equitable
basis.
Anything beyond
this basic, although not guaranteed system, is obviously where money comes into
play. Alternatively, put more succinctly: guarantees cost money.
To some extent,
this also applies with respect to security, privacy, authenticity, etc.
These things can be tendencies of the system, but guarantees need only be
provided by commercially oriented third parties or implementers of variants
of the system that do have these properties.
Maybe this
helps clarify my stance of separating system design from add-on features
necessary for commercial viability of a particular business model.
Fault Tolerance
Alongside
equitable balancing of a minimal load, we have the need to tolerate faults.
Stability is not only maintained by being balanced internally, it is also maintained
by avoiding becoming unbalanced by external forces.
If it's possible for the system to work around a failure of some sort, then we should design the system to do that. We don't want an unstable system that comes crashing to a halt at the first hint of a divide by zero error somewhere. We want a stable system that has a good operational life expectancy given the environment it has to work in. Our environment consists of the following expected failures including local (abrupt shutdown, storage failure, corruption), messaging (message loss/re-ordering/duplication/delay/corruption/spuriosity) and connectivity (abrupt disconnection/isolation, network partition).
This means we
must tolerate or route around expected failure. Withstanding unexpected failure,
or concerted attack is not in our remit (that's a job for security).
Therefore, in general we make no assumptions concerning the usability of any protocol we use. It's simply a case of send a message to a recipient. However, we at least need measures to let us know whether a message we've received is intact, e.g. a CRC. It may also be useful for quality monitoring purposes to add information (sequence numbers, timestamps, etc.) to messages. This lets us select the best quality communication channel we're using at any one time another case where we'll have a heuristics based decision to make.
The communications module should support the creation of recipient groups. This allows multicast and broadcast communications channels to be used where appropriate.
Most messages will be state updates, and so it doesn't really matter where they come from or whether we received them unnecessarily (perhaps via broadcast). Even so, it's still useful to know the sender of a message in order to monitor the performance of an ongoing communications relationship.
NB You will probably notice that nothing so far impedes the ability to tamper with, remove or manufacture messages. That's discussed in the security section later on.
Because most messages are state updates, it's not particularly significant if a message is lost. Such loss can almost be considered a bandwidth reduction. This is why we need to monitor the rate of loss, and take only remedial measures, such as selecting a better quality channel, if it exceeds a certain limit.
There is always a negotiation between two nodes as to the respective ages of objects they have for a particular interest, e.g. "For Interest X, my oldest object is 1,900ms send me any younger updates". Each node also keeps track of what it has previously sent, so it won't send an update twice if it hasn't changed in the interim. Bear in mind that updates are only concerning arbitrated values rather than locally computed values. A node also keeps track of where the update is from. This stops node A regurgitating back to node B any updates it had acquired from node B. However, it's possible that node A and B can redundantly update each other with what they're both receiving from node C. Where this duplication is a significant consumption of bandwidth, it can be prevented by having nodes A and B tell each other "I'm in touch with node C, so only tell me about stuff you learnt from someone other than C". This assumes that the trip C->B->A is longer than C->A. In general, however, it should be expected that if you express the same general interest to N peers, that a proportion of each update will be duplicated up to N times. Another option is to express a specific interest, but with the qualification that only details concerning objects owned by the peer should be supplied (or at a higher priority).
Balancing peer connections depends upon the heuristics discussed earlier: the greater coverage versus duplication of update versus the greater chance of timelines versus bandwidth consumption. All we need to recognize here is that lost updates are highly liable to be re-sent, superseded or duplicated.
We are not distributing events, so the loss or reordering of that kind of critical message will not affect us. Remote method calls are expected, or perhaps encouraged, to be non-critical (to tolerate no call or duplicated calls), but features can be provided in the virtual machine to support reliable, transactional behavior if this is required. However, this has to be used sparingly, or one might be better off using JavaSpaces.
Therefore, in general, the system is happy to cope with an unreliable messaging layer. However, techniques to improve the quality of messaging that don't significantly impinge on bandwidth or latency are fine and should be exploited, but they should still not be performed transparently, as we need to monitor all aspects of quality for other purposes.
Disconnection has to be defined qualitatively (and probably heuristically as well) as the point at which communications performance is insufficient for a timely and graceful closure of the connection between two nodes, i.e. there has been no intervening period where quality has reduced to a tolerable but unsatisfactory level. Both nodes are expected to recognize this condition at the same time (though one or both of them may simply have crashed). Once a disconnection has occurred, the relationship between nodes is irrevocably changed, and even if communications performance resumes immediately, any future relationship should be determined according to the normal selection procedures.
1. Peer failure, disconnection from peer. The least severe connection problem is one that affects the quality of communication between two nodes that have a peer relationship.
Using a heuristic, its possible that a node will have decided that it is worth making a peer subscription to a particular node in order to obtain fresher updates to objects it's interested in. This will only be maintained if the quality of the connection is actually sufficient for this purpose. As the quality deteriorates it becomes more likely that the costs of the peer subscription outweigh the benefit and the peer connection will then be abandoned.
There's little handshaking required, so if it's a complete disconnection then each peer will simply abandon any subscriptions to the other and perhaps consider alternative nodes.
Generally, it should be unusual for a peer connection to break completely if the peer has not become totally isolated, i.e. if it still has good communication with its parent, then that channel should also be available for a peer connection. Thus, peer disconnections are likely to be peer failures.
In any case, from one node's perspective, it's not possible to tell whether a peer crashed or simply became incommunicado. Therefore, this distinction doesn't really affect matters. Of course, things might be different from the other node's perspective.
2. Node failure, complete isolation. When a node crashes, or completely loses its connection with the network, we can consider that node isolated (see figure two).
Figure 2 - An isolated node
In an isolated state, a node can continue operating, modeling the virtual environment and presenting its attendant players with a reasonable experience. However, from this point on, the node has automatically lost ownership of all owned objects, and therefore no changes that occur from this point on will have a lasting effect. One can though, conceive of hypothetical scenarios where the node and descendants decide to excommunicate themselves and continue as though the node were a new root, in which case persistence could resume (requiring isolation forever after). This is because, in practice, an isolated node then inevitably diverges from consensual reality, and upon reconnection, the divergent state will be overwritten by authoritative updates.
I'd suggest that the game designer should indicate to players when there is a connection problem, either overtly, or perhaps by bringing a fog into the scene, i.e. something that has the same effect as isolating the player from their ability to affect or observe a virtual world that is rapidly becoming an alternate reality.
If a node simply crashes (or has lost confidence in its local integrity), it should be brought to the player's attention, and remedial measures be taken to resume normal operation, e.g. a restart; there shouldn't be much client-side information (player identification, rendering preferences, etc.). Therefore, if the computer's a write-off, the player can still relocate their avatar. Remember, as it's a distributed system any player can use any connected computer they like, the only advantage to using the same computer is that the cached object store is more likely to be useful.
Now these are just the node's perspectives. The perspective from the rest of the nodes can be much more involved. The other consequences of a node becoming isolated can be seen in the following, more specific failures.
3. Leaf-node disconnection from parent. Still referring to figure two, let's consider what the repercussions would be if the node were a leaf-node, i.e. a player's computer that had no children. In this case the computer continues to operate, and network access is maintained, but contact with the parent ceases.
Ownership of any objects this node owned (a player avatar perhaps) will automatically (without any negotiation required) revert to its parent (if the parent has crashed, ownership will further revert to its parent). Why? Well, because that's the way of responsibility, it reverts up the chain of command, up the hierarchy. The nodes toward the root are more reliable and thus are the best to which ownership should revert. Another way of looking at it is to consider that if the disconnection is accidental, it doesn't really matter, but if it was deliberate then we err on the side of the 'more responsible' node. The repercussions of a disconnection should certainly not benefit the player. If they did, there'd be a hell of a lot of it going on.
Any recent state of owned objects that hadn't quite managed to be passed up to the parent will be overwritten upon incoming updates concerning those objects (if it has changed in the interim). If ownership of an object can be regained prior to anything else changing it, then the state can be retained. However, it's quite likely that an object will be modeled, and thus changed, by the parent (or its parent) prior to possession being regained.
Any peers that are connected to the node may soon decide that it is no longer advantageous to subscribe to this node, given it no longer owns anything. However, this node is still likely to be just as interested (if not more so) in its current peer subscriptions.
The node's primary task now is to locate a parent. Given that heuristic parent determination is always going on, suitable alternative candidates are already known. Now that the cost of changing parent has reduced to zero, the best alternative candidate can be selected and made the new parent. It may be that the original parent becomes available once more (its reliability statistics suitably adjusted) and can be reconsidered for parenthood.
Once a parent is re-established, the node will express its interests, and ownership of particularly interesting objects may be regained (possibly, possibly not). It's worth mentioning again that ownership is not essential for a player to affect the virtual world, it is simply a responsibility that is given to nodes most suitable to model those objects. It can also be considered a burden as much as a benefit.
Note that even without a parent, a node may still viably present a reasonably accurate view of the virtual world and allow the player to affect it. Remote method invocation is used to affect unowned objects, so it doesn't really matter to the player which node arbitrates over the objects they affect, just so long as the node that does own the object is reachable. This tends to be assured given that unreachable nodes lose ownership to their parents.
If for some reason a node becomes reachable only via intermediaries, then there's a yucky network problem going on. However, I'm relying on the fact that either there'll be a highly connective P2P communications protocol, or the communications module of this system will have to implement its own, which it may have to anyway, simply to provide a backup in case the P2P protocol has an Achilles heel like Napster and gets clobbered.
Even so, given games' need to be designed to tolerate lost messages, and especially not reliant on transactional behavior, it is not too grave a situation for nodes to be unable to contact a few other nodes simply because the routes don't work (it's equivalent to the messages always being lost). It may also be that occurrences of "unreachability" can be reported to a senior of the "unreachable" node and will affect the weighting of factors influencing the ability of that node to own objects. After all, it may be that some computers have firewalls that block certain connections. Note that I'm just referring to "reachability" here, as opposed to any decision by nodes not to trust each other.
So finally, we can see that if, for whatever reason, a parent becomes unreachable, that there is a short-lived moment while the node transitions to a new parent, but that otherwise, apart from lost recent state updates (which may manifest as a discontinuity in some objects), it is not a particularly traumatic event. It's certainly not something that upsets the node involved, let alone the entire system.
What should be done about the discontinuities? I'd recommend accepting the updates immediately. However, this is one of those endlessly debatable issues. The alternative is to present a convergent reality. But this depends upon whether players prefer a believable interim fiction to an unpleasant jolt back to truth (a consensual one though it is).
Figure 3 - A branch node disconnection
4. Branch-node disconnection from parent. What happens if the node that loses its parent has children? This would be a branch node. It may have an attendant player, but not necessarily. It just happens to be of sufficient goodness as a parent for some other nodes to select it. It may be a gateway node on a home LAN that serves a few kids' wireless handheld computers (leaf nodes). It might even be a large box hosted by an ISP.
The child nodes of a node that's lost a parent also lose any ownership that they may have had. They'll be informed about this by their parent at the highest priority so that any peer nodes subscribing to them won't continue to be misinformed regarding the arbitrated state of objects. Updates from the new owner will, simply overwrite any misinformation that does sneak out.
A child node that's just seen its parent lose its parent, will in the course of its continuous evaluation of its parent's suitability versus alternative candidates, decide that another node may be a better parent. However, here, there's likely to be slightly less of a clear-cut decision to be made than being parentless. It may take a bit longer for the weighting to build up in terms of lack of timely updates before the child decides to change parent. This may be enough of a hiatus in which the parent could re-establish another parent and thus retain its worthiness in the eyes of its children.
Figure 4 - Disconnection of root node
5. Root node failure, disconnection from children. If the root node becomes disconnected, it is as if each of its children lost their parent. They will most likely re-parent to each other (in an orderly but non-deterministic way). Ultimately we will end up with one of these children remaining without a parent (because the only candidates are now its descendants). Given it knows that its previous parent was the root (and parentless) this is the one clear case in which this node is automatically justified in considering itself effectively excommunicated from its parent, i.e. to become root regent, assuming ownership of everything from now on, never parenting to another node. Therefore, this node is now the new root and is responsible for everything. The previous root has effectively performed an ungraceful abdication.
Given that collectively its children are likely to know just as much, if not more than the root, the loss of the root is not going to greatly impact the system. Of course, if the root had a 100-terabyte store, and its children collectively amounted to only a terabyte, then we might expect a significant risk of losing persistence. However, if the children had a collection of about 1,000 terabytes then the loss may be negligible.
The system's heuristics will try to select the best root node (and "near root" nodes), but if you ask, "What happens in the event of a disaster?" well, there's no magic wand to prevent it from being a disaster. Well, you might then ask, "Why not add a hot fail-over node in case the root goes?" and I'd reply "By all means! Please add all the fully-loaded nodes you can afford". Every one of them can be considered a fail-over node. This is a distributed system. The load is shared, but state is also duplicated. Even if the best root node disappears, an equal or the second best node will take over if it fails. Therefore, all children of a root can be considered hot fail-overs. There should be enough state duplication that nothing is lost through the root's disappearance. We only have to ensure that there's nothing special about the root going down compared with any other node.
It may be worth mentioning the process of a graceful abdication at this point. In order for the root to be changed gracefully the existing root has to abdicate. It first informs its children as to its intention to abdicate in order that they then re-parent where possible (until only one child remains). The remaining child is the root apparent. The root abdicant then delegates all its responsibility and records to this root apparent. The root abdicant then completes its abdication (disconnects), and the system now has a new root. The previous root may rejoin as an ordinary node, but would have to parent itself to a node, e.g. the new root. It would then attract children and increase its likelihood of obtaining responsibility. In due course, it may become root presumptive.
So root failure isn't going to bring the system down, although, potentially there may be some noticeable loss if there's insufficient duplication near the root.
Figure 5 - Partition, Isolation of a group of nodes
6. Partition multiple, simultaneous disconnection. Partition is one of the more disastrous failure modes that we can look forward to. This is where a group of nodes become disconnected from the rest of the hierarchy. This group of nodes isn't necessarily interrelated. So one could end up with several detached sub-hierarchies, some with peer relationships between them, and some quite isolated. However, these partitioned nodes are still potentially mutually reachable. You might imagine someone driving a jackhammer through the optical fiber that connected a college campus to the rest of the Internet (ignore the backups). Not all the computers on the campus may have had peer or hierarchical relationships with each other, but they can still reach each other.
All parentless nodes on both sides of the partition will attempt to find parents via the usual routes: their continuously evaluated alternatives, a cache of known nodes, a list of well known nodes, broadcast channels, directory services, etc. What we'll end up with is that nodes on the same side of the partition as the root, should eventually rejoin the same hierarchy. Those nodes on the other side of the partition will probably coalesce into a single hierarchy. Ultimately we end up in a situation where one or more senior nodes are unable to contact any non-descendant node. These nodes can then either inform their descendants that they are in an unconnected state (please inform your players of a break in service), or they can opt to carry on regardless, knowing that if the usurping root ever re-parents itself when the partition ends, that intervening state will be overridden (if it has changed due to modeling by nodes on the other side of the partition).
It is conceivable that some games may cater to a partition, presenting it as an opportunity for a group of players to explore strategy, i.e. with a limited sub-set of the virtual world, the partitioned players can play for a while knowing that what they do is likely to be without consequence, so that upon reconnection, if their part of the world has not significantly changed, they can replay their actions as if the partition hadn't happened. Note that this merge can't really be done automatically as subtle things may have changed whose significance can only be appreciated by the intelligent player, e.g. I might jump off a building during a partition, but decide against doing it when the partition ends, because the lorry full of hay has moved.
Partitions can be minor (a LAN disconnection) or major (a rampant worm), in the event of widespread failure. Perhaps a term for such a major partition would be a shattering see figure six.
Figure 6 - Shattering, Multiple Partition
Other Failures
So now, perhaps
it's easier to see how just as the system allocates responsibility for arbitrating
over object state, so it is also organized into a meritocratic hierarchy in
order to provide arbitration over transfer of responsibility in the event of
failure. However, there are many other kinds of failure. Those covered so far
are largely of the expected sort. Unexpected failures are another kettle of
fish and can largely be lumped together with misbehavior or failures due to
sabotage. These are discussed in the next article on security.
Further Reading
Here are a few papers (PDF) I came across that provide a more rigorous exposition of distributed systems:
What's Coming Next?
Once we have a system that can cope under a diverse range of expected stresses (although some are hopefully rare), then we're in a better position to see how we could cope with unexpected stresses, along with deliberate and cunning attempts to break the system. Of course, it's not an easy task trying to cater for the unexpected, but let's give it a go in the next article.
http://www.gamasutra.com/features/20011226/fitch_02.htm
Copyright © 2003 CMP Media Inc. All rights reserved.