Subject Re: RFC: Clustering
Author Roman Rokytskyy
> This architecture is fraught with many problems, each of which must
> be solved.

:) Academics call it "state machine replication". Interestingly
enough, it is widely used in distributed systems, but I have never
heard about it being used in databases. If I remember correctly, one
of the issues was the performance.

> The hardest is how to make sure the servers execute all in
> the same precise order. Without this, the scheme can't work.
> I don't have any idea of how you might do this, but perhaps
> you have a solution.

This is solvable, at least in case of LAN. One can use reliable
multicasting with total ordering protocol stack, which would guarantee
that the messages are delivered to all nodes in the same order. WANs
have bigger latencies, so the reliable multicast is not that reallistic.

> Second, there is a problem of non-deterministic behaviors such as a
> random number generator or translation of the manifest constant
> "now". Each will yield different results on different system and the
> servers will diverge.

IF such code is used, the state machine replication cannot be used -
the execution becomes nondeterministic. So far (I'd say, years
2002-2003, when I finished researching that area) there were no
solution and only one workaround - "do update on one site and
replicate the modified records/pages to other nodes".

> Finally, I don't see any way for a node to rejoin the
> cluster without taking down the database. This rather defeats the
> scheme, doesn't it?

Well, in theory, alive nodes can detect the crash (typical property of
the reliable multicast protocol, take it as granted). At that point
they can start writing the difference file like nbackup does. When the
node rejoins the group, only difference must be replicated. So the
only issue is to rejoin the group soon, otherwise difference might get
bigger than the database size.

Roman