Subject Re: RFC: Clustering
Author Roman Rokytskyy
> Huh? Total ordering protocol stack? That takes a little
> explaining, particularly with a network made out of store and
> forward switches (about the only thing on the market today). Do you
> really think that the order of receipt of two packets from clients
> at opposite ends of a LAN is deterministic? How could it be? Any
> collision on ethernet causes each node to do a random rate and a
> retry. One collision anywhere on the LAN and the whole scheme
> collapses.

We are talking about reliable multicast (which in fact can run either
on UDP or TCP), a mode when a message is reliably transferred to all
nodes in a group. The total ordering protocol stack is built on top of
that one and uses totally-ordered sequence numbers (or some other
approach of the similar nature). The out-of-sequence messages are
buffered until their preceding ones arrive. Good overview of various
implementations can be found in "Distributed Systems. Concepts and
Design" by G.Coulouris et al., Chapter 11.

The JGroups toolkit has achieved ~55 MBps in an 8-node cluster with
Gigabit switch (however, the total order stack was not tested, see
more at http://www.jgroups.org/javagroupsnew/perf/Report.html).
Considering that this is written in Java (data are copied back and
forth, no zero-copy sockets are used, etc.) that is not a bad result.

> How do clients find out that a new node has been added? If you tell
> the client too soon, the entire system hangs until the database
> clone is complete. If you tell them too late, the new clone misses
> an updates and the systems diverge.

Assuming that clone gets updates faster than the differences are
produced (replication of the differences must also go via the same
totally ordered stack), at some point it will get the database up to
date. Then it multicasts a message to all nodes using the same
protocol stack that the replication is complete. At that point nodes
stop to write a difference file and simply multicast the changes. That
is the simplest that came to my mind, one definitely can invent
something faster.

That was the theory, now comes practice.

Bela Ban (main author of JGroups) was talking to HSQLDB guys to add
replication using this scheme (we have exchanged few emails at that
time), and even something was implemented. But then things have
changed, JBoss became commercial entity and JGroups became the
clustering mechanism for JBoss. The clustered HSQLDB project is now
either postponed or abandoned.

I know that there is a working database replication based on reliable
multicast for PostgreSQL. It has being developed at the Swiss Federal
Institute of Technology, Zurich under the name Postgres-R (see also
http://gborg.postgresql.org/project/pgreplication/projdisplay.php).

Roman