It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.

Search articles, jobs, buyers guide, and more.

By Crosbie Fitch
Gamasutra
[Author's Bio]
December 26, 2001

Introduction

Moving towrds Open Source middleware based platforms

Heuristics

Fault Tolerance

Printer Friendly Version
   
Letters to the Editor:
Write a letter
View all letters


Features

Cyberspace in the 21st Century:
Stability Before Security

Fault Tolerance
Alongside equitable balancing of a minimal load, we have the need to tolerate faults. Stability is not only maintained by being balanced internally, it is also maintained by avoiding becoming unbalanced by external forces.

If it's possible for the system to work around a failure of some sort, then we should design the system to do that. We don't want an unstable system that comes crashing to a halt at the first hint of a divide by zero error somewhere. We want a stable system that has a good operational life expectancy given the environment it has to work in. Our environment consists of the following expected failures including local (abrupt shutdown, storage failure, corruption), messaging (message loss/re-ordering/duplication/delay/corruption/spuriosity) and connectivity (abrupt disconnection/isolation, network partition).

This means we must tolerate or route around expected failure. Withstanding unexpected failure, or concerted attack is not in our remit (that's a job for security).

  • Local failure. I expect you know how to go about monitoring the local system and maintaining its integrity. It can be as crude as a re-install upon any error, or as sophisticated as a transactional database running on a RAID drive. Suffice it to say, I am expecting that it is difficult for each node to maintain confidence in its local integrity. However, for the period that a node has lost integrity or operational viability it must be considered failed, and should be abruptly disconnected from the system. This is precisely the same situation where someone cutting the wire disconnects the node (discussed later).
  • Messaging failure. The system will have a communications module that should be designed to sit on top of bare bones, no frills, networking layer, e.g. a UDP. This is so we can make our own choice as to the reliability we require for a particular communication. If other protocols are available, such as TCP, it's possible that they will come in handy upon occasion, either for their specific features, or simply as alternatives (it may be that TCP can work where UDP does not).

Therefore, in general we make no assumptions concerning the usability of any protocol we use. It's simply a case of send a message to a recipient. However, we at least need measures to let us know whether a message we've received is intact, e.g. a CRC. It may also be useful for quality monitoring purposes to add information (sequence numbers, timestamps, etc.) to messages. This lets us select the best quality communication channel we're using at any one time — another case where we'll have a heuristics based decision to make.

The communications module should support the creation of recipient groups. This allows multicast and broadcast communications channels to be used where appropriate.

Most messages will be state updates, and so it doesn't really matter where they come from or whether we received them unnecessarily (perhaps via broadcast). Even so, it's still useful to know the sender of a message in order to monitor the performance of an ongoing communications relationship.

NB You will probably notice that nothing so far impedes the ability to tamper with, remove or manufacture messages. That's discussed in the security section later on.
Because most messages are state updates, it's not particularly significant if a message is lost. Such loss can almost be considered a bandwidth reduction. This is why we need to monitor the rate of loss, and take only remedial measures, such as selecting a better quality channel, if it exceeds a certain limit.

There is always a negotiation between two nodes as to the respective ages of objects they have for a particular interest, e.g. "For Interest X, my oldest object is 1,900ms — send me any younger updates". Each node also keeps track of what it has previously sent, so it won't send an update twice if it hasn't changed in the interim. Bear in mind that updates are only concerning arbitrated values rather than locally computed values. A node also keeps track of where the update is from. This stops node A regurgitating back to node B any updates it had acquired from node B. However, it's possible that node A and B can redundantly update each other with what they're both receiving from node C. Where this duplication is a significant consumption of bandwidth, it can be prevented by having nodes A and B tell each other "I'm in touch with node C, so only tell me about stuff you learnt from someone other than C". This assumes that the trip C->B->A is longer than C->A. In general, however, it should be expected that if you express the same general interest to N peers, that a proportion of each update will be duplicated up to N times. Another option is to express a specific interest, but with the qualification that only details concerning objects owned by the peer should be supplied (or at a higher priority).

Balancing peer connections depends upon the heuristics discussed earlier: the greater coverage versus duplication of update versus the greater chance of timelines versus bandwidth consumption. All we need to recognize here is that lost updates are highly liable to be re-sent, superseded or duplicated.

We are not distributing events, so the loss or reordering of that kind of critical message will not affect us. Remote method calls are expected, or perhaps encouraged, to be non-critical (to tolerate no call or duplicated calls), but features can be provided in the virtual machine to support reliable, transactional behavior if this is required. However, this has to be used sparingly, or one might be better off using JavaSpaces.

Therefore, in general, the system is happy to cope with an unreliable messaging layer. However, techniques to improve the quality of messaging that don't significantly impinge on bandwidth or latency are fine and should be exploited, but they should still not be performed transparently, as we need to monitor all aspects of quality for other purposes.

  • Connectivity. I'll now move on to the bigger failures that we need to cope with in distributed systems. More specifically, what happens when nodes or groups of nodes have communication difficulties? I'll be using some diagrams to sketch out each type of failure. The relatively arbitrary diagramming conventions I've used are shown in figure one. Connectivity failures in this system can be classified as peer failure (disconnection from peer), Node failure, complete isolationLeaf node disconnection from parent, branch node disconnection from parent, root node failure, disconnection from children and partition (multiple, simultaneous disconnection).

    Disconnection has to be defined qualitatively (and probably heuristically as well) as the point at which communications performance is insufficient for a timely and graceful closure of the connection between two nodes, i.e. there has been no intervening period where quality has reduced to a tolerable but unsatisfactory level. Both nodes are expected to recognize this condition at the same time (though one or both of them may simply have crashed). Once a disconnection has occurred, the relationship between nodes is irrevocably changed, and even if communications performance resumes immediately, any future relationship should be determined according to the normal selection procedures.

1. Peer failure, disconnection from peer. The least severe connection problem is one that affects the quality of communication between two nodes that have a peer relationship.

Using a heuristic, its possible that a node will have decided that it is worth making a peer subscription to a particular node in order to obtain fresher updates to objects it's interested in. This will only be maintained if the quality of the connection is actually sufficient for this purpose. As the quality deteriorates it becomes more likely that the costs of the peer subscription outweigh the benefit and the peer connection will then be abandoned.

There's little handshaking required, so if it's a complete disconnection then each peer will simply abandon any subscriptions to the other and perhaps consider alternative nodes.

Generally, it should be unusual for a peer connection to break completely if the peer has not become totally isolated, i.e. if it still has good communication with its parent, then that channel should also be available for a peer connection. Thus, peer disconnections are likely to be peer failures.

In any case, from one node's perspective, it's not possible to tell whether a peer crashed or simply became incommunicado. Therefore, this distinction doesn't really affect matters. Of course, things might be different from the other node's perspective.

2. Node failure, complete isolation. When a node crashes, or completely loses its connection with the network, we can consider that node isolated (see figure two).

Figure 2 - An isolated node

In an isolated state, a node can continue operating, modeling the virtual environment and presenting its attendant players with a reasonable experience. However, from this point on, the node has automatically lost ownership of all owned objects, and therefore no changes that occur from this point on will have a lasting effect. One can though, conceive of hypothetical scenarios where the node and descendants decide to excommunicate themselves and continue as though the node were a new root, in which case persistence could resume (requiring isolation forever after). This is because, in practice, an isolated node then inevitably diverges from consensual reality, and upon reconnection, the divergent state will be overwritten by authoritative updates.

I'd suggest that the game designer should indicate to players when there is a connection problem, either overtly, or perhaps by bringing a fog into the scene, i.e. something that has the same effect as isolating the player from their ability to affect or observe a virtual world — that is rapidly becoming an alternate reality.

If a node simply crashes (or has lost confidence in its local integrity), it should be brought to the player's attention, and remedial measures be taken to resume normal operation, e.g. a restart; there shouldn't be much client-side information (player identification, rendering preferences, etc.). Therefore, if the computer's a write-off, the player can still relocate their avatar. Remember, as it's a distributed system any player can use any connected computer they like, the only advantage to using the same computer is that the cached object store is more likely to be useful.

Now these are just the node's perspectives. The perspective from the rest of the nodes can be much more involved. The other consequences of a node becoming isolated can be seen in the following, more specific failures.

3. Leaf-node disconnection from parent. Still referring to figure two, let's consider what the repercussions would be if the node were a leaf-node, i.e. a player's computer that had no children. In this case the computer continues to operate, and network access is maintained, but contact with the parent ceases.

Ownership of any objects this node owned (a player avatar perhaps) will automatically (without any negotiation required) revert to its parent (if the parent has crashed, ownership will further revert to its parent). Why? Well, because that's the way of responsibility, it reverts up the chain of command, up the hierarchy. The nodes toward the root are more reliable and thus are the best to which ownership should revert. Another way of looking at it is to consider that if the disconnection is accidental, it doesn't really matter, but if it was deliberate then we err on the side of the 'more responsible' node. The repercussions of a disconnection should certainly not benefit the player. If they did, there'd be a hell of a lot of it going on.

Any recent state of owned objects that hadn't quite managed to be passed up to the parent will be overwritten upon incoming updates concerning those objects (if it has changed in the interim). If ownership of an object can be regained prior to anything else changing it, then the state can be retained. However, it's quite likely that an object will be modeled, and thus changed, by the parent (or its parent) prior to possession being regained.

Any peers that are connected to the node may soon decide that it is no longer advantageous to subscribe to this node, given it no longer owns anything. However, this node is still likely to be just as interested (if not more so) in its current peer subscriptions.

The node's primary task now is to locate a parent. Given that heuristic parent determination is always going on, suitable alternative candidates are already known. Now that the cost of changing parent has reduced to zero, the best alternative candidate can be selected and made the new parent. It may be that the original parent becomes available once more (its reliability statistics suitably adjusted) and can be reconsidered for parenthood.

Once a parent is re-established, the node will express its interests, and ownership of particularly interesting objects may be regained (possibly, possibly not). It's worth mentioning again that ownership is not essential for a player to affect the virtual world, it is simply a responsibility that is given to nodes most suitable to model those objects. It can also be considered a burden as much as a benefit.

Note that even without a parent, a node may still viably present a reasonably accurate view of the virtual world and allow the player to affect it. Remote method invocation is used to affect unowned objects, so it doesn't really matter to the player which node arbitrates over the objects they affect, just so long as the node that does own the object is reachable. This tends to be assured given that unreachable nodes lose ownership to their parents.

If for some reason a node becomes reachable only via intermediaries, then there's a yucky network problem going on. However, I'm relying on the fact that either there'll be a highly connective P2P communications protocol, or the communications module of this system will have to implement its own, which it may have to anyway, simply to provide a backup in case the P2P protocol has an Achilles heel like Napster and gets clobbered.

Even so, given games' need to be designed to tolerate lost messages, and especially not reliant on transactional behavior, it is not too grave a situation for nodes to be unable to contact a few other nodes simply because the routes don't work (it's equivalent to the messages always being lost). It may also be that occurrences of "unreachability" can be reported to a senior of the "unreachable" node and will affect the weighting of factors influencing the ability of that node to own objects. After all, it may be that some computers have firewalls that block certain connections. Note that I'm just referring to "reachability" here, as opposed to any decision by nodes not to trust each other.

So finally, we can see that if, for whatever reason, a parent becomes unreachable, that there is a short-lived moment while the node transitions to a new parent, but that otherwise, apart from lost recent state updates (which may manifest as a discontinuity in some objects), it is not a particularly traumatic event. It's certainly not something that upsets the node involved, let alone the entire system.

What should be done about the discontinuities? I'd recommend accepting the updates immediately. However, this is one of those endlessly debatable issues. The alternative is to present a convergent reality. But this depends upon whether players prefer a believable interim fiction to an unpleasant jolt back to truth (a consensual one though it is).

Figure 3 - A branch node disconnection

4. Branch-node disconnection from parent. What happens if the node that loses its parent has children? This would be a branch node. It may have an attendant player, but not necessarily. It just happens to be of sufficient goodness as a parent for some other nodes to select it. It may be a gateway node on a home LAN that serves a few kids' wireless handheld computers (leaf nodes). It might even be a large box hosted by an ISP.

The child nodes of a node that's lost a parent also lose any ownership that they may have had. They'll be informed about this by their parent at the highest priority so that any peer nodes subscribing to them won't continue to be misinformed regarding the arbitrated state of objects. Updates from the new owner will, simply overwrite any misinformation that does sneak out.

A child node that's just seen its parent lose its parent, will in the course of its continuous evaluation of its parent's suitability versus alternative candidates, decide that another node may be a better parent. However, here, there's likely to be slightly less of a clear-cut decision to be made than being parentless. It may take a bit longer for the weighting to build up in terms of lack of timely updates before the child decides to change parent. This may be enough of a hiatus in which the parent could re-establish another parent and thus retain its worthiness in the eyes of its children.

Figure 4 - Disconnection of root node

5. Root node failure, disconnection from children. If the root node becomes disconnected, it is as if each of its children lost their parent. They will most likely re-parent to each other (in an orderly but non-deterministic way). Ultimately we will end up with one of these children remaining without a parent (because the only candidates are now its descendants). Given it knows that its previous parent was the root (and parentless) this is the one clear case in which this node is automatically justified in considering itself effectively excommunicated from its parent, i.e. to become root regent, assuming ownership of everything from now on, never parenting to another node. Therefore, this node is now the new root and is responsible for everything. The previous root has effectively performed an ungraceful abdication.

Given that collectively its children are likely to know just as much, if not more than the root, the loss of the root is not going to greatly impact the system. Of course, if the root had a 100-terabyte store, and its children collectively amounted to only a terabyte, then we might expect a significant risk of losing persistence. However, if the children had a collection of about 1,000 terabytes then the loss may be negligible.

The system's heuristics will try to select the best root node (and "near root" nodes), but if you ask, "What happens in the event of a disaster?" well, there's no magic wand to prevent it from being a disaster. Well, you might then ask, "Why not add a hot fail-over node in case the root goes?" and I'd reply "By all means! Please add all the fully-loaded nodes you can afford". Every one of them can be considered a fail-over node. This is a distributed system. The load is shared, but state is also duplicated. Even if the best root node disappears, an equal or the second best node will take over if it fails. Therefore, all children of a root can be considered hot fail-overs. There should be enough state duplication that nothing is lost through the root's disappearance. We only have to ensure that there's nothing special about the root going down compared with any other node.

It may be worth mentioning the process of a graceful abdication at this point. In order for the root to be changed gracefully the existing root has to abdicate. It first informs its children as to its intention to abdicate in order that they then re-parent where possible (until only one child remains). The remaining child is the root apparent. The root abdicant then delegates all its responsibility and records to this root apparent. The root abdicant then completes its abdication (disconnects), and the system now has a new root. The previous root may rejoin as an ordinary node, but would have to parent itself to a node, e.g. the new root. It would then attract children and increase its likelihood of obtaining responsibility. In due course, it may become root presumptive.

So root failure isn't going to bring the system down, although, potentially there may be some noticeable loss if there's insufficient duplication near the root.

Figure 5 - Partition, Isolation of a group of nodes

6. Partition — multiple, simultaneous disconnection. Partition is one of the more disastrous failure modes that we can look forward to. This is where a group of nodes become disconnected from the rest of the hierarchy. This group of nodes isn't necessarily interrelated. So one could end up with several detached sub-hierarchies, some with peer relationships between them, and some quite isolated. However, these partitioned nodes are still potentially mutually reachable. You might imagine someone driving a jackhammer through the optical fiber that connected a college campus to the rest of the Internet (ignore the backups). Not all the computers on the campus may have had peer or hierarchical relationships with each other, but they can still reach each other.

All parentless nodes on both sides of the partition will attempt to find parents via the usual routes: their continuously evaluated alternatives, a cache of known nodes, a list of well known nodes, broadcast channels, directory services, etc. What we'll end up with is that nodes on the same side of the partition as the root, should eventually rejoin the same hierarchy. Those nodes on the other side of the partition will probably coalesce into a single hierarchy. Ultimately we end up in a situation where one or more senior nodes are unable to contact any non-descendant node. These nodes can then either inform their descendants that they are in an unconnected state (please inform your players of a break in service), or they can opt to carry on regardless, knowing that if the usurping root ever re-parents itself when the partition ends, that intervening state will be overridden (if it has changed due to modeling by nodes on the other side of the partition).

It is conceivable that some games may cater to a partition, presenting it as an opportunity for a group of players to explore strategy, i.e. with a limited sub-set of the virtual world, the partitioned players can play for a while knowing that what they do is likely to be without consequence, so that upon reconnection, if their part of the world has not significantly changed, they can replay their actions as if the partition hadn't happened. Note that this merge can't really be done automatically as subtle things may have changed whose significance can only be appreciated by the intelligent player, e.g. I might jump off a building during a partition, but decide against doing it when the partition ends, because the lorry full of hay has moved.

Partitions can be minor (a LAN disconnection) or major (a rampant worm), in the event of widespread failure. Perhaps a term for such a major partition would be a shattering — see figure six.

Figure 6 - Shattering, Multiple Partition

Other Failures

So now, perhaps it's easier to see how just as the system allocates responsibility for arbitrating over object state, so it is also organized into a meritocratic hierarchy in order to provide arbitration over transfer of responsibility in the event of failure. However, there are many other kinds of failure. Those covered so far are largely of the expected sort. Unexpected failures are another kettle of fish and can largely be lumped together with misbehavior or failures due to sabotage. These are discussed in the next article on security.

Further Reading

Here are a few papers (PDF) I came across that provide a more rigorous exposition of distributed systems:

What's Coming Next?

Once we have a system that can cope under a diverse range of expected stresses (although some are hopefully rare), then we're in a better position to see how we could cope with unexpected stresses, along with deliberate and cunning attempts to break the system. Of course, it's not an easy task trying to cater for the unexpected, but let's give it a go in the next article.

______________________________________________________

Back to [Introduction]


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2003 CMP Media LLC

privacy policy
| terms of service