|
Features

Cyberspace
in the 21st Century:
Stability Before Security
Fault
Tolerance
Alongside
equitable balancing of a minimal load, we have the need to tolerate faults.
Stability is not only maintained by being balanced internally, it is also
maintained by avoiding becoming unbalanced by external forces.
If it's
possible for the system to work around a failure of some sort, then we
should design the system to do that. We don't want an unstable system
that comes crashing to a halt at the first hint of a divide by zero error
somewhere. We want a stable system that has a good operational life expectancy
given the environment it has to work in. Our environment consists of the
following expected failures including local (abrupt shutdown, storage
failure, corruption), messaging (message loss/re-ordering/duplication/delay/corruption/spuriosity)
and connectivity (abrupt disconnection/isolation, network partition).
This means
we must tolerate or route around expected failure. Withstanding unexpected
failure, or concerted attack is not in our remit (that's a job for security).
- Local
failure. I expect you know how to go about monitoring the local
system and maintaining its integrity. It can be as crude as a re-install
upon any error, or as sophisticated as a transactional database running
on a RAID drive. Suffice it to say, I am expecting that it is difficult
for each node to maintain confidence in its local integrity. However,
for the period that a node has lost integrity or operational viability
it must be considered failed, and should be abruptly disconnected from
the system. This is precisely the same situation where someone cutting
the wire disconnects the node (discussed later).
- Messaging
failure. The system will have a communications module that should
be designed to sit on top of bare bones, no frills, networking layer,
e.g. a UDP. This is so we can make our own choice as to the reliability
we require for a particular communication. If other protocols are available,
such as TCP, it's possible that they will come in handy upon occasion,
either for their specific features, or simply as alternatives (it may
be that TCP can work where UDP does not).
Therefore,
in general we make no assumptions concerning the usability of any protocol
we use. It's simply a case of send a message to a recipient. However,
we at least need measures to let us know whether a message we've received
is intact, e.g. a CRC. It may also be useful for quality monitoring
purposes to add information (sequence numbers, timestamps, etc.) to
messages. This lets us select the best quality communication channel
we're using at any one time another case where we'll have a heuristics
based decision to make.
The communications
module should support the creation of recipient groups. This allows
multicast and broadcast communications channels to be used where appropriate.
Most messages
will be state updates, and so it doesn't really matter where they come
from or whether we received them unnecessarily (perhaps via broadcast).
Even so, it's still useful to know the sender of a message in order
to monitor the performance of an ongoing communications relationship.
NB You
will probably notice that nothing so far impedes the ability to tamper
with, remove or manufacture messages. That's discussed in the security
section later on.
Because most messages are state updates, it's not particularly significant
if a message is lost. Such loss can almost be considered a bandwidth
reduction. This is why we need to monitor the rate of loss, and take
only remedial measures, such as selecting a better quality channel,
if it exceeds a certain limit.
There
is always a negotiation between two nodes as to the respective ages
of objects they have for a particular interest, e.g. "For Interest
X, my oldest object is 1,900ms send me any younger updates".
Each node also keeps track of what it has previously sent, so it won't
send an update twice if it hasn't changed in the interim. Bear in mind
that updates are only concerning arbitrated values rather than locally
computed values. A node also keeps track of where the update is from.
This stops node A regurgitating back to node B any updates it had acquired
from node B. However, it's possible that node A and B can redundantly
update each other with what they're both receiving from node C. Where
this duplication is a significant consumption of bandwidth, it can be
prevented by having nodes A and B tell each other "I'm in touch
with node C, so only tell me about stuff you learnt from someone other
than C". This assumes that the trip C->B->A is longer than
C->A. In general, however, it should be expected that if you express
the same general interest to N peers, that a proportion of each update
will be duplicated up to N times. Another option is to express a specific
interest, but with the qualification that only details concerning objects
owned by the peer should be supplied (or at a higher priority).
Balancing
peer connections depends upon the heuristics discussed earlier: the
greater coverage versus duplication of update versus the greater chance
of timelines versus bandwidth consumption. All we need to recognize
here is that lost updates are highly liable to be re-sent, superseded
or duplicated.
We are
not distributing events, so the loss or reordering of that kind of critical
message will not affect us. Remote method calls are expected, or perhaps
encouraged, to be non-critical (to tolerate no call or duplicated calls),
but features can be provided in the virtual machine to support reliable,
transactional behavior if this is required. However, this has to be
used sparingly, or one might be better off using JavaSpaces.
Therefore,
in general, the system is happy to cope with an unreliable messaging
layer. However, techniques to improve the quality of messaging that
don't significantly impinge on bandwidth or latency are fine and should
be exploited, but they should still not be performed transparently,
as we need to monitor all aspects of quality for other purposes.
- Connectivity.
I'll now move on to the bigger failures that we need to cope with in
distributed systems. More specifically, what happens when nodes or groups
of nodes have communication difficulties? I'll be using some diagrams
to sketch out each type of failure. The relatively arbitrary diagramming
conventions I've used are shown in figure
one. Connectivity failures in this system can be classified as peer
failure (disconnection from peer), Node failure, complete isolationLeaf
node disconnection from parent, branch node disconnection from parent,
root node failure, disconnection from children and partition (multiple,
simultaneous disconnection).
Disconnection
has to be defined qualitatively (and probably heuristically as well)
as the point at which communications performance is insufficient for
a timely and graceful closure of the connection between two nodes,
i.e. there has been no intervening period where quality has reduced
to a tolerable but unsatisfactory level. Both nodes are expected to
recognize this condition at the same time (though one or both of them
may simply have crashed). Once a disconnection has occurred, the relationship
between nodes is irrevocably changed, and even if communications performance
resumes immediately, any future relationship should be determined
according to the normal selection procedures.
1.
Peer failure, disconnection from peer. The least severe connection
problem is one that affects the quality of communication between two
nodes that have a peer relationship.
Using
a heuristic, its possible that a node will have decided that it is
worth making a peer subscription to a particular node in order to
obtain fresher updates to objects it's interested in. This will only
be maintained if the quality of the connection is actually sufficient
for this purpose. As the quality deteriorates it becomes more likely
that the costs of the peer subscription outweigh the benefit and the
peer connection will then be abandoned.
There's
little handshaking required, so if it's a complete disconnection then
each peer will simply abandon any subscriptions to the other and perhaps
consider alternative nodes.
Generally,
it should be unusual for a peer connection to break completely if
the peer has not become totally isolated, i.e. if it still has good
communication with its parent, then that channel should also be available
for a peer connection. Thus, peer disconnections are likely to be
peer failures.
In any
case, from one node's perspective, it's not possible to tell whether
a peer crashed or simply became incommunicado. Therefore, this distinction
doesn't really affect matters. Of course, things might be different
from the other node's perspective.
2.
Node failure, complete isolation. When a node crashes, or completely
loses its connection with the network, we can consider that node isolated
(see figure two).
In
an isolated state, a node can continue operating, modeling the virtual
environment and presenting its attendant players with a reasonable
experience. However, from this point on, the node has automatically
lost ownership of all owned objects, and therefore no changes that
occur from this point on will have a lasting effect. One can though,
conceive of hypothetical scenarios where the node and descendants
decide to excommunicate themselves and continue as though the node
were a new root, in which case persistence could resume (requiring
isolation forever after). This is because, in practice, an isolated
node then inevitably diverges from consensual reality, and upon reconnection,
the divergent state will be overwritten by authoritative updates.
I'd
suggest that the game designer should indicate to players when there
is a connection problem, either overtly, or perhaps by bringing a
fog into the scene, i.e. something that has the same effect as isolating
the player from their ability to affect or observe a virtual world
that is rapidly becoming an alternate reality.
If a
node simply crashes (or has lost confidence in its local integrity),
it should be brought to the player's attention, and remedial measures
be taken to resume normal operation, e.g. a restart; there shouldn't
be much client-side information (player identification, rendering
preferences, etc.). Therefore, if the computer's a write-off, the
player can still relocate their avatar. Remember, as it's a distributed
system any player can use any connected computer they like, the only
advantage to using the same computer is that the cached object store
is more likely to be useful.
Now
these are just the node's perspectives. The perspective from the rest
of the nodes can be much more involved. The other consequences of
a node becoming isolated can be seen in the following, more specific
failures.
3.
Leaf-node disconnection from parent.
Still referring to figure two, let's consider what the repercussions
would be if the node were a leaf-node, i.e. a player's computer that
had no children. In this case the computer continues to operate, and
network access is maintained, but contact with the parent ceases.
Ownership
of any objects this node owned (a player avatar perhaps) will automatically
(without any negotiation required) revert to its parent (if the parent
has crashed, ownership will further revert to its parent). Why? Well,
because that's the way of responsibility, it reverts up the chain
of command, up the hierarchy. The nodes toward the root are more reliable
and thus are the best to which ownership should revert. Another way
of looking at it is to consider that if the disconnection is accidental,
it doesn't really matter, but if it was deliberate then we err on
the side of the 'more responsible' node. The repercussions of a disconnection
should certainly not benefit the player. If they did, there'd be a
hell of a lot of it going on.
Any
recent state of owned objects that hadn't quite managed to be passed
up to the parent will be overwritten upon incoming updates concerning
those objects (if it has changed in the interim). If ownership of
an object can be regained prior to anything else changing it, then
the state can be retained. However, it's quite likely that an object
will be modeled, and thus changed, by the parent (or its parent) prior
to possession being regained.
Any
peers that are connected to the node may soon decide that it is no
longer advantageous to subscribe to this node, given it no longer
owns anything. However, this node is still likely to be just as interested
(if not more so) in its current peer subscriptions.
The
node's primary task now is to locate a parent. Given that heuristic
parent determination is always going on, suitable alternative candidates
are already known. Now that the cost of changing parent has reduced
to zero, the best alternative candidate can be selected and made the
new parent. It may be that the original parent becomes available once
more (its reliability statistics suitably adjusted) and can be reconsidered
for parenthood.
Once
a parent is re-established, the node will express its interests, and
ownership of particularly interesting objects may be regained (possibly,
possibly not). It's worth mentioning again that ownership is not essential
for a player to affect the virtual world, it is simply a responsibility
that is given to nodes most suitable to model those objects. It can
also be considered a burden as much as a benefit.
Note
that even without a parent, a node may still viably present a reasonably
accurate view of the virtual world and allow the player to affect
it. Remote method invocation is used to affect unowned objects, so
it doesn't really matter to the player which node arbitrates over
the objects they affect, just so long as the node that does own the
object is reachable. This tends to be assured given that unreachable
nodes lose ownership to their parents.
If for
some reason a node becomes reachable only via intermediaries, then
there's a yucky network problem going on. However, I'm relying on
the fact that either there'll be a highly connective P2P communications
protocol, or the communications module of this system will have to
implement its own, which it may have to anyway, simply to provide
a backup in case the P2P protocol has an Achilles heel like Napster
and gets clobbered.
Even
so, given games' need to be designed to tolerate lost messages, and
especially not reliant on transactional behavior, it is not too grave
a situation for nodes to be unable to contact a few other nodes simply
because the routes don't work (it's equivalent to the messages always
being lost). It may also be that occurrences of "unreachability"
can be reported to a senior of the "unreachable" node and
will affect the weighting of factors influencing the ability of that
node to own objects. After all, it may be that some computers have
firewalls that block certain connections. Note that I'm just referring
to "reachability" here, as opposed to any decision by nodes
not to trust each other.
So finally,
we can see that if, for whatever reason, a parent becomes unreachable,
that there is a short-lived moment while the node transitions to a
new parent, but that otherwise, apart from lost recent state updates
(which may manifest as a discontinuity in some objects), it is not
a particularly traumatic event. It's certainly not something that
upsets the node involved, let alone the entire system.
What
should be done about the discontinuities? I'd recommend accepting
the updates immediately. However, this is one of those endlessly debatable
issues. The alternative is to present a convergent reality. But this
depends upon whether players prefer a believable interim fiction to
an unpleasant jolt back to truth (a consensual one though it is).
4.
Branch-node disconnection from parent. What happens if the node
that loses its parent has children? This would be a branch node. It
may have an attendant player, but not necessarily. It just happens
to be of sufficient goodness as a parent for some other nodes to select
it. It may be a gateway node on a home LAN that serves a few kids'
wireless handheld computers (leaf nodes). It might even be a large
box hosted by an ISP.
The
child nodes of a node that's lost a parent also lose any ownership
that they may have had. They'll be informed about this by their parent
at the highest priority so that any peer nodes subscribing to them
won't continue to be misinformed regarding the arbitrated state of
objects. Updates from the new owner will, simply overwrite any misinformation
that does sneak out.
A child
node that's just seen its parent lose its parent, will in the course
of its continuous evaluation of its parent's suitability versus alternative
candidates, decide that another node may be a better parent. However,
here, there's likely to be slightly less of a clear-cut decision to
be made than being parentless. It may take a bit longer for the weighting
to build up in terms of lack of timely updates before the child decides
to change parent. This may be enough of a hiatus in which the parent
could re-establish another parent and thus retain its worthiness in
the eyes of its children.
5.
Root node failure, disconnection from children. If the root node
becomes disconnected, it is as if each of its children lost their
parent. They will most likely re-parent to each other (in an orderly
but non-deterministic way). Ultimately we will end up with one of
these children remaining without a parent (because the only candidates
are now its descendants). Given it knows that its previous parent
was the root (and parentless) this is the one clear case in which
this node is automatically justified in considering itself effectively
excommunicated from its parent, i.e. to become root regent, assuming
ownership of everything from now on, never parenting to another node.
Therefore, this node is now the new root and is responsible for everything.
The previous root has effectively performed an ungraceful abdication.
Given
that collectively its children are likely to know just as much, if
not more than the root, the loss of the root is not going to greatly
impact the system. Of course, if the root had a 100-terabyte store,
and its children collectively amounted to only a terabyte, then we
might expect a significant risk of losing persistence. However, if
the children had a collection of about 1,000 terabytes then the loss
may be negligible.
The
system's heuristics will try to select the best root node (and "near
root" nodes), but if you ask, "What happens in the event
of a disaster?" well, there's no magic wand to prevent it from
being a disaster. Well, you might then ask, "Why not add a hot
fail-over node in case the root goes?" and I'd reply "By
all means! Please add all the fully-loaded nodes you can afford".
Every one of them can be considered a fail-over node. This is a distributed
system. The load is shared, but state is also duplicated. Even if
the best root node disappears, an equal or the second best node will
take over if it fails. Therefore, all children of a root can be considered
hot fail-overs. There should be enough state duplication that nothing
is lost through the root's disappearance. We only have to ensure that
there's nothing special about the root going down compared with any
other node.
It may
be worth mentioning the process of a graceful abdication at this point.
In order for the root to be changed gracefully the existing root has
to abdicate. It first informs its children as to its intention to
abdicate in order that they then re-parent where possible (until only
one child remains). The remaining child is the root apparent. The
root abdicant then delegates all its responsibility and records to
this root apparent. The root abdicant then completes its abdication
(disconnects), and the system now has a new root. The previous root
may rejoin as an ordinary node, but would have to parent itself to
a node, e.g. the new root. It would then attract children and increase
its likelihood of obtaining responsibility. In due course, it may
become root presumptive.
So root
failure isn't going to bring the system down, although, potentially
there may be some noticeable loss if there's insufficient duplication
near the root.
6.
Partition multiple, simultaneous disconnection. Partition
is one of the more disastrous failure modes that we can look forward
to. This is where a group of nodes become disconnected from the rest
of the hierarchy. This group of nodes isn't necessarily interrelated.
So one could end up with several detached sub-hierarchies, some with
peer relationships between them, and some quite isolated. However,
these partitioned nodes are still potentially mutually reachable.
You might imagine someone driving a jackhammer through the optical
fiber that connected a college campus to the rest of the Internet
(ignore the backups). Not all the computers on the campus may have
had peer or hierarchical relationships with each other, but they can
still reach each other.
All
parentless nodes on both sides of the partition will attempt to find
parents via the usual routes: their continuously evaluated alternatives,
a cache of known nodes, a list of well known nodes, broadcast channels,
directory services, etc. What we'll end up with is that nodes on the
same side of the partition as the root, should eventually rejoin the
same hierarchy. Those nodes on the other side of the partition will
probably coalesce into a single hierarchy. Ultimately we end up in
a situation where one or more senior nodes are unable to contact any
non-descendant node. These nodes can then either inform their descendants
that they are in an unconnected state (please inform your players
of a break in service), or they can opt to carry on regardless, knowing
that if the usurping root ever re-parents itself when the partition
ends, that intervening state will be overridden (if it has changed
due to modeling by nodes on the other side of the partition).
It is
conceivable that some games may cater to a partition, presenting it
as an opportunity for a group of players to explore strategy, i.e.
with a limited sub-set of the virtual world, the partitioned players
can play for a while knowing that what they do is likely to be without
consequence, so that upon reconnection, if their part of the world
has not significantly changed, they can replay their actions as if
the partition hadn't happened. Note that this merge can't really be
done automatically as subtle things may have changed whose significance
can only be appreciated by the intelligent player, e.g. I might jump
off a building during a partition, but decide against doing it when
the partition ends, because the lorry full of hay has moved.
Partitions
can be minor (a LAN disconnection) or major (a rampant worm), in the
event of widespread failure. Perhaps a term for such a major partition
would be a shattering see figure six.
Other Failures
So now,
perhaps it's easier to see how just as the system allocates responsibility
for arbitrating over object state, so it is also organized into a meritocratic
hierarchy in order to provide arbitration over transfer of responsibility
in the event of failure. However, there are many other kinds of failure.
Those covered so far are largely of the expected sort. Unexpected failures
are another kettle of fish and can largely be lumped together with misbehavior
or failures due to sabotage. These are discussed in the next article on
security.
Further Reading
Here are
a few papers (PDF) I came across that provide a more rigorous exposition
of distributed systems:
What's Coming Next?
Once we
have a system that can cope under a diverse range of expected stresses
(although some are hopefully rare), then we're in a better position to
see how we could cope with unexpected stresses, along with deliberate
and cunning attempts to break the system. Of course, it's not an easy
task trying to cater for the unexpected, but let's give it a go in the
next article.
______________________________________________________
|