Subject Re: Cloud databases
Author paulruizendaal
> So, rather than think of the cloud as meaning that all of the engine
> functions are performed by each machine, it is better to think of
> de-constructing the engine into its raw pieces and seeing how to
> distribute those functions.

Agree. It is my understanding of this forum that we have the
following deconstructions on the whiteboard:

1. Distributed object stores (memcached, coherence, etc.). A separate
layer of machines storing the data in memory. Proven to work and to
scale out to 100+ machines. Problem: not relational and not so easy
to see how to address that.

2. Middleware (i.e. sequoia, some of the Kemme research, sharding). A
separate layer at the SQL level which distributes and coordinates
queries to plain vanilla db servers. Seems to work for some usage,
but our "whiteboard" is still a bit fuzzy on how it is best
implemented and how it scales under heavy write load. Another issue
would be how it deals with queries that do not follow the
partitioning.

Although I have no knowledge of such, I would not be surprised if the
Salesforce architecture included some of this approach. It would work
for them as everything is partionable by tennant and cross-tennant
stuff does not happen.

3. SQL on top of a distributed datastore (Nimbus, Kemme, even RAC).
This approach is currently filling up most of our whiteboard. To put
a phrase on it, let's call this distributing at the transaction
level. In essence we separate storage from relational processing.
This has 2 subtopics (i) coordinating transactions across a network
(ii) storing data across a network.

Under (i) we have: don't do it (e.g. Google), Jim's careful
bookkeeping, Kemme's total ordering and fused caches (RAC). Under
(ii) we have: don't do it (Jim, Kemme) and partioning (MySQL ndb
cluster?).

With item 3(i) we had come to the 'conclusion' that coordinating
transactions would max out (on current network hardware) at some 1 to
5 thousand write tps across 25..100 machines. [dissenting views:
please speak up]

With item 3(ii) it is an open debate. Under the "don't do it"
approach each storage node has to write everything. With current
hardware, that is only going to work if we use mostly streaming
writes (100MB/s speeds) and not if we use random writes (100 tps or
so). Jim seems to have chosen streaming writes, Kemme does not
address the issue. Partitioning might work, but so far it seems like
a lot of bookkeeping at both the SQL level and the storage level.

Better disk hardware would push out the bottleneck to the 1 to 5
thousand write tps level, at which point we experience a network
bottleneck too. If affordable 0.1 ms latency random write flash SSD's
exist, the 3(ii) issue has become currently moot in practical &
economical terms and we might as well use simple, classical
algorithms with random writes.


Any bits of deconstruction that I missed?

Paul