Subject Re: [Firebird-Architect] Re: Special Relativity and the Problem of Database Scalability
Author Paul Ruizendaal
> I'm not sure that this description is 100% correct.

I may indeed be mistaken and the discussion in this list may bring us all
to a higher level of understanding.

> In my understanding (though I admit that I did not finish reading that
> particular paper), the write set is not "persisted" before commit. In
> other words, no single local write lock is acquired before commit. Only
> on commit all update statements are collected together and send over the

> total ordered messaging system to all nodes where they are applied
locally.
>
> Otherwise you could have a situation that T1 updates record on node A,
> T2 updates the same record on node B, then both decide to commit, and
> messages sent on commit are ordered so that write set from T2 comes
> before write set of T1. In this case local commit of T1 on node A will
> succeed, but will fail on node B and vice versa.
>
> So, if this thinking is correct, there must exist some "buffer" between
> the actual database pages and rest of the engine that temporary
> "simulates" writes within the lifetime of a particular local transaction

> and then sends the write set on commit over the CGS. This buffer must
> also support reads, since nobody prohibits me to read just modified
> record within my local transaction.

As I understand it, there was indeed such a buffer in her earlier work,
1998-2000 when she was still at the venerable ETH in Zurich. It was based
on locks, rather than MVCC:
http://www.cs.mcgill.ca/~kemme/papers/vldb00.pdf

The way I understand it, in the 2005 refinement the buffer role is taken
on by the snapshot. Both T1 and T2 will see their own update, but not the
other before commit. When T1 and T2 want commit they broadcast the update
message (also to themselves), but do not yet commit locally. Only upon
receipt of the message back the local commit will happen.

Node A and B receive T2 first. Node A will do nothing yet, just queue the
message. Node B, seeing that T2 is a local transaction waiting to commit
will now commit locally and it will succeed. It broadcasts an (unordered)
"commit ok" message. Upon receipt of that message, Node A will also apply
and commit the pending T2 message (which will always succeed).

T1 is delivered next on both node A and B. Node B will do nothing yet,
just as node A did for T2 above. Node A again sees that T1 is a pending
local transaction and attempts to commit. It will fail, as committed T2 has
updated the record after T1 began. It broadcasts an (unordered) "commit
abort" message. Upon receipt of that message, Node B will discard message
T1.

> Regarding the serialization. The GCS has not that many communication
> primitives - total order, casual order, fifo order. I am not sure that
> casual order is enough to satisfy the serializability requirement
> defined by Ann.

You are right: consistent order is not a GCS primitive, just a
mathematical concept.

I think you are right on total order being required, but I'm not 100% sure
yet that Lamport's (partial) causal order cannot be made to work. Whether
this is worth considering depends on the additional communication needed to
move from causal to total. If the extra network overhead is low, it is not
worth doing -- assuming causal would work at all.

It is my understanding that Spread can totally order 300..500 messages per
second on a 100Mpbs lan, and perhaps 3,000 on a 1000Mbps lan. Assuming that
the average business app user generates 1 update *transaction* per minute,
the total order design can support nearly 200,000 concurrent users
(assuming no other capacity constraints exist). This is well in excess of
the load on a single Salesforce.com instance, so perhaps relaxing the
ordering level is not worth thinking about.

Note that the Kemme proof-of-concept maxes out on the limited ability of
Postgres to keep up with the updates: persisting it all to disk just takes
too long. In a design where we have distributed memory based storage this
bottleneck disappears. The 'write everywhere' burden also becomes much
lower: only the local cache needs updating.

Paul