firebird-architect - Re: [Firebird-Architect] Kemme Paper

Subject	Re: [Firebird-Architect] Kemme Paper
Author	Jim Starkey
Post date	2010-02-06T17:45:48Z

Paul, when a system that purports to maintain serializability for
consistency turns out not to be serializable, just about all other
claims need to be reconsidered. For example, does it depend on
serializabilty to keep replicants from diverging? I can't point to a
smoking gun since I stopped reading when I realized that it wasn't, in
fact, serializable. But it's based on the idea that replicas because
identical and so transactions applied to each replica will yield
identical results. If that isn't true, the replicas completely diverge...

Paul Ruizendaal wrote:

>> My point is that the Kemme scheme is not serializable, which was it's
>> claim to fame.
>>
>
> Why are you (of all people) so hung up on "classic serializable"? Snapshot
> isolation isn't serializable either and serializable is a red herring in
> this discusion. From earlier conversation in this thread, I think
> everybody
> agrees on what differences between nodes are perfectly okay. Let's just
> stick with the four rules:
> 1. Stable view of data within a transaction
> 2. Cannot modify a record which has been modified in a commited concurrent
> transaction
> 3. Obeys consistency constraints within and across transactions
> 4. Ann's rule: same state on all nodes when quiesced
> So far I believe both your design and the Kemme design comply with the
> four rules. So a cloud database has to be ACID and SCOA :^)
>
>
>> This does not say that it isn't consistent, but it's an
>> unnecessarily expensive way to implement consistency. That said, I'm
>> not prepared to vouch for it's consistency, but I don't have a
>> counter-example either.
>>
>
> So we are back to the best way to minimize communication and maximize
> parallelism. This is indeed the interesting core of the discussion and I
> prefer to discuss it based on facts, figures and reasoned arguments.
>
>
>>>> The paper states the total ordering can deliver hundreds of message
>>>> per second. Since each commit require two messages (one to send the
>>>> commit and another, ordered, to reapply the commit), [...]
>>>>
>
> In my understanding that is not entirely correct. There is the multicast
> commit request sent out, which is delivered ordered to all nodes including
> the sender. That is one logical message, although the ordering and
> reliable
> delivery will require more messages (some at the tcp level, some at the
> gcs
> level). Then there is the commit ok/abort message which requires unordered
> but still reliable delivery. At the ip level there may be as much as 10
> messages; I don't know really: with e.g. acknowledgements piggybacked on
> payload it gets real though to arrive at a deterministic number.
> Benchmarking is the only way.
>
> All is relative. How many messages in an alternative design? This is an
> open question for me, as I haven't closed my mind on any approach. Note
> that we so far discussed the simple core of each protocol: in real life we
> have to deal with broken connections, failing nodes, lost messages,
> multiple deliveries of a message, slow nodes, etc. In my gutfeel 90% of
> the messaging code will be about reliable delivery and failure cases, not
> the core algorithm.
>
>
>> It's a stupid idea. It's performance is directly related to the highes
>> latency in the network. OK when everything goes through a Gigabit
>> switch, but pretty dreadful otherwise.
>>
>
> I wasn't aware that Nimbus was being optimized for operation over a WAN or
> the Internet. Cool! What's your rationale for that?
>
> The use case that I have in mind is something that works comfortably for
> 10 users, but also for a wannabe salesforce.com or netsuite.com. The
> hardware that I have in mind is a LAN with 10 or 20 or 100 sql processing
> nodes (not that I even have 10 nodes in my home network :^). As discussed
> several times before on this list, salesforce.com scales to ~30,000
> concurrent users on a single instance (they have ~10 instances) using
> homebrew j2ee middleware and oracle rac at a TCO of $15 per concurrent
> user, so that is the state of the art today.
>
> In a 10 year old test on old hardware (PIII, 100Mbps LAN), Spread needed
> 2-3 ms to totally order messages between 20 nodes. Note that this includes
> dealing with failure cases, monitoring group membership changes and
> guaranteeing virtual synchrony.
>
> What throughputs are you achieving in the current builds of Nimbus? On
> what hardware? LAN/WAN? Is that code also dealing with failure cases?
>

--
Jim Starkey
NimbusDB, Inc.
978 526-1376

[Non-text portions of this message have been removed]