Subject Re: [Firebird-Architect] Kemme Paper
Author Paul Ruizendaal
> My point is that the Kemme scheme is not serializable, which was it's
> claim to fame.

Why are you (of all people) so hung up on "classic serializable"? Snapshot
isolation isn't serializable either and serializable is a red herring in
this discusion. From earlier conversation in this thread, I think
everybody
agrees on what differences between nodes are perfectly okay. Let's just
stick with the four rules:
1. Stable view of data within a transaction
2. Cannot modify a record which has been modified in a commited concurrent
transaction
3. Obeys consistency constraints within and across transactions
4. Ann's rule: same state on all nodes when quiesced
So far I believe both your design and the Kemme design comply with the
four rules. So a cloud database has to be ACID and SCOA :^)

> This does not say that it isn't consistent, but it's an
> unnecessarily expensive way to implement consistency. That said, I'm
> not prepared to vouch for it's consistency, but I don't have a
> counter-example either.

So we are back to the best way to minimize communication and maximize
parallelism. This is indeed the interesting core of the discussion and I
prefer to discuss it based on facts, figures and reasoned arguments.

>>> The paper states the total ordering can deliver hundreds of message
>>> per second. Since each commit require two messages (one to send the
>>> commit and another, ordered, to reapply the commit), [...]

In my understanding that is not entirely correct. There is the multicast
commit request sent out, which is delivered ordered to all nodes including
the sender. That is one logical message, although the ordering and
reliable
delivery will require more messages (some at the tcp level, some at the
gcs
level). Then there is the commit ok/abort message which requires unordered
but still reliable delivery. At the ip level there may be as much as 10
messages; I don't know really: with e.g. acknowledgements piggybacked on
payload it gets real though to arrive at a deterministic number.
Benchmarking is the only way.

All is relative. How many messages in an alternative design? This is an
open question for me, as I haven't closed my mind on any approach. Note
that we so far discussed the simple core of each protocol: in real life we
have to deal with broken connections, failing nodes, lost messages,
multiple deliveries of a message, slow nodes, etc. In my gutfeel 90% of
the messaging code will be about reliable delivery and failure cases, not
the core algorithm.

> It's a stupid idea. It's performance is directly related to the highes
> latency in the network. OK when everything goes through a Gigabit
> switch, but pretty dreadful otherwise.

I wasn't aware that Nimbus was being optimized for operation over a WAN or
the Internet. Cool! What's your rationale for that?

The use case that I have in mind is something that works comfortably for
10 users, but also for a wannabe salesforce.com or netsuite.com. The
hardware that I have in mind is a LAN with 10 or 20 or 100 sql processing
nodes (not that I even have 10 nodes in my home network :^). As discussed
several times before on this list, salesforce.com scales to ~30,000
concurrent users on a single instance (they have ~10 instances) using
homebrew j2ee middleware and oracle rac at a TCO of $15 per concurrent
user, so that is the state of the art today.

In a 10 year old test on old hardware (PIII, 100Mbps LAN), Spread needed
2-3 ms to totally order messages between 20 nodes. Note that this includes
dealing with failure cases, monitoring group membership changes and
guaranteeing virtual synchrony.

What throughputs are you achieving in the current builds of Nimbus? On
what hardware? LAN/WAN? Is that code also dealing with failure cases?

Paul