Subject Re: [Firebird-Architect] Kemme Paper
Author Jim Starkey
For god's sakes, Paul, I waded into that swamp at your insistence. Here
is a review of the bidding:

Me (Jim):

The information required for the two mechanisms is radically different.
> A commit sequences needs to know the full read and write sets, so
> if a transaction counted a million records, the commit sequencer needs
> to know not only which million records but which records weren't read
> (e.g. hadn't arrived at that node yet).


You (Paul):

Have you actually read and pondered the paper?? A one million record
update would first occur locally, a single, short commit request would be
sent out and when it is received back again a local commit is attempted. If
it fails, we're done. If it commits, a second commit confirmed message
would go out to the other nodes, where we can be certain it will succeed.
Sending write sets as data is an optimisation for simple updates; complex
updates are sent as the (compiled) statement. So it is 1..2 short messages
per write transaction.


No, I hadn't read and ponder the paper. I was responding to what it
would take to implement a Kemme scheme that was serializable.

So, I read and the pondered the paper. It doesn't work and it isn't
serializable. It's just wrong. Actually, shallow and wrong.

My original position is that a database system can't be distributed,
serializable, and scalable. I stand by that. Serializability is a
sufficient condition for consistent, but isn't a necessary condition.

But going back to the Kemme paper, if the scheme isn't serializable,
what is the point of the message total ordering and multi-site commit?
It's just complex overhead that mucks up the system without any added
value. And what it doesn't do is let the database system tell the poor
application that an update failed so it can take corrective action. In
short, it can't implement SQL semantics for error reporting. But
mostly it doesn't do what it purported to.




Paul Ruizendaal wrote:
>> My point is that the Kemme scheme is not serializable, which was it's
>> claim to fame.
>>
>
> Why are you (of all people) so hung up on "classic serializable"? Snapshot
> isolation isn't serializable either and serializable is a red herring in
> this discusion. From earlier conversation in this thread, I think
> everybody
> agrees on what differences between nodes are perfectly okay. Let's just
> stick with the four rules:
> 1. Stable view of data within a transaction
> 2. Cannot modify a record which has been modified in a commited concurrent
> transaction
> 3. Obeys consistency constraints within and across transactions
> 4. Ann's rule: same state on all nodes when quiesced
> So far I believe both your design and the Kemme design comply with the
> four rules. So a cloud database has to be ACID and SCOA :^)
>

Oh, god, another FLA. OK, what is SCOA?
(http://en.wikipedia.org/wiki/Shorecrest_High_School)

>
>> This does not say that it isn't consistent, but it's an
>> unnecessarily expensive way to implement consistency. That said, I'm
>> not prepared to vouch for it's consistency, but I don't have a
>> counter-example either.
>>
>
> So we are back to the best way to minimize communication and maximize
> parallelism. This is indeed the interesting core of the discussion and I
> prefer to discuss it based on facts, figures and reasoned arguments.
>
>
>>>> The paper states the total ordering can deliver hundreds of message
>>>> per second. Since each commit require two messages (one to send the
>>>> commit and another, ordered, to reapply the commit), [...]
>>>>
>
> In my understanding that is not entirely correct. There is the multicast
> commit request sent out, which is delivered ordered to all nodes including
> the sender. That is one logical message, although the ordering and
> reliable
> delivery will require more messages (some at the tcp level, some at the
> gcs
> level). Then there is the commit ok/abort message which requires unordered
> but still reliable delivery. At the ip level there may be as much as 10
> messages; I don't know really: with e.g. acknowledgements piggybacked on
> payload it gets real though to arrive at a deterministic number.
> Benchmarking is the only way.
>
> All is relative. How many messages in an alternative design? This is an
> open question for me, as I haven't closed my mind on any approach. Note
> that we so far discussed the simple core of each protocol: in real life we
> have to deal with broken connections, failing nodes, lost messages,
> multiple deliveries of a message, slow nodes, etc. In my gutfeel 90% of
> the messaging code will be about reliable delivery and failure cases, not
> the core algorithm.
>
>
>> It's a stupid idea. It's performance is directly related to the highes
>> latency in the network. OK when everything goes through a Gigabit
>> switch, but pretty dreadful otherwise.
>>
>
> I wasn't aware that Nimbus was being optimized for operation over a WAN or
> the Internet. Cool! What's your rationale for that?
>
Mostly discovering that we could. I rather like the idea of a NimbusDB
ATM database half in Tokyo and half in New York. Most of the time Tokyo
data stays in Tokyo and New York data stays in New York. But when a New
Yorker with a craving for sushi runs out of money in Tokyo, well, Bob's
your uncle.

The serial point is that NimbusDB can straddle data centers or let a
company with a NimbusDB database in a public cloud keep an archive node
locally just in case something happens to the public cloud provider.

Another benefit is keeping an archive node in a bunker under the state
of Nebraska, just in case.

Incidentally, I've invented a new term. Some definitions:

* NimbusDB /database:/ a set of related tables, indexes, access
permissions, etc.
* NimbusDB /chorus:/ a set of computers serving a NimbusDB database
* NimbusDB /cloud: /a set of computing hosting an arbitrary number
of NimbusDB choruses / databases

Now we have NimbusDB /duets,/ which are the minimum chorus configuration
of one transactional node and one archive node, and NimbusDB /quartets,/
which are four node, fully redundant chorus.

Ego, NimbusDB sings.
> The use case that I have in mind is something that works comfortably for
> 10 users, but also for a wannabe salesforce.com or netsuite.com. The
> hardware that I have in mind is a LAN with 10 or 20 or 100 sql processing
> nodes (not that I even have 10 nodes in my home network :^). As discussed
> several times before on this list, salesforce.com scales to ~30,000
> concurrent users on a single instance (they have ~10 instances) using
> homebrew j2ee middleware and oracle rac at a TCO of $15 per concurrent
> user, so that is the state of the art today.
>
> In a 10 year old test on old hardware (PIII, 100Mbps LAN), Spread needed
> 2-3 ms to totally order messages between 20 nodes. Note that this includes
> dealing with failure cases, monitoring group membership changes and
> guaranteeing virtual synchrony.
>
> What throughputs are you achieving in the current builds of Nimbus? On
> what hardware? LAN/WAN? Is that code also dealing with failure cases?
>
>

I have to duck those questions. Sorry.

--
Jim Starkey
NimbusDB, Inc.
978 526-1376



[Non-text portions of this message have been removed]