Subject Re: [Firebird-Architect] Re: RFC: Clustering
Author Jim Starkey
Roman Rokytskyy wrote:
>
>> The hardest is how to make sure the servers execute all in
>> the same precise order. Without this, the scheme can't work.
>> I don't have any idea of how you might do this, but perhaps
>> you have a solution.
>>
>
> This is solvable, at least in case of LAN. One can use reliable
> multicasting with total ordering protocol stack, which would guarantee
> that the messages are delivered to all nodes in the same order. WANs
> have bigger latencies, so the reliable multicast is not that reallistic.
>
Huh? Total ordering protocol stack? That takes a little explaining,
particularly with a network made out of store and forward switches
(about the only thing on the market today). Do you really think that
the order of receipt of two packets from clients at opposite ends of a
LAN is deterministic? How could it be? Any collision on ethernet
causes each node to do a random rate and a retry. One collision
anywhere on the LAN and the whole scheme collapses.
>
>> Second, there is a problem of non-deterministic behaviors such as a
>> random number generator or translation of the manifest constant
>> "now". Each will yield different results on different system and the
>> servers will diverge.
>>
>
> IF such code is used, the state machine replication cannot be used -
> the execution becomes nondeterministic. So far (I'd say, years
> 2002-2003, when I finished researching that area) there were no
> solution and only one workaround - "do update on one site and
> replicate the modified records/pages to other nodes".
>
The same problem exists in statement based replication. There are a
number of solutions, all bad. One is for the system to have knowledge
of all possible non-deterministic functions and pass "current" values
along with the query. That said, a function that generates UIDs is
pretty much guaranteed to fry one architecture or another.
>
>> Finally, I don't see any way for a node to rejoin the
>> cluster without taking down the database. This rather defeats the
>> scheme, doesn't it?
>>
>
> Well, in theory, alive nodes can detect the crash (typical property of
> the reliable multicast protocol, take it as granted). At that point
> they can start writing the difference file like nbackup does. When the
> node rejoins the group, only difference must be replicated. So the
> only issue is to rejoin the group soon, otherwise difference might get
> bigger than the database size.
>
>
How do clients find out that a new node has been added? If you tell the
client too soon, the entire system hangs until the database clone is
complete. If you tell them too late, the new clone misses an updates
and the systems diverge.