Subject Re: Well, here we go again
Author paulruizendaal
Hi Jim,

"It's a classic cluster design like Rdb, Interbase, or Firebird:
shared disk synchronized by cluster lock manager"

Allow me a nerdy little pun. The brain is like a huge L2 cache on
reality. Like any cache, it has its issues with stale data. The
classic design moved on in the late 90's. First let's talk Oracle,
and the more interesting discussion of the design pattern next.

Since 9i, Oracle has implemented "cache fusion", where caches are
updated via a high speed local link. This avoids the "disk ping" for
cache updates that crippled the earlier implementations. The 9i
implementation was a bit basic and maxed out at around 8 nodes. The
11g implementation is much better and scales much further, especially
when combined with (REF) hash partitioned tables. Google for "cache
fusion 11" to get tons of material.

Now let's take a step back and think it through from first
principles. A lot of people recognise that that a distributed memory
store is a great way to scale out. It is like a hughe L2 cache on
data, to use your own words. Many people see that a hash based node
allocation algorithm makes sense with real world data. Memcached is a
straightforward implementation of those ideas. Some recognise the
benefits of using a stable hash when adding/removing nodes.

Next one would like to add some structure, such as tables and
transactions. One approach could be to divide the memory in pages and
build a classic database structure on top. Dirty pages are written to
disk on commit. This would be a simple RAC design.

But why commit to disk? One could also commit by replicating the
dirty pages to another node, and lazily writing to disk. This scales
much better. This has some issues as well. See the following link for
some interesting analysis. I will not copy it into this post, but
many of his points are worthy of comment, agreement or disagreement.
http://www.openlinksw.com/weblog/oerling/?id=1229
(I like this sentence: "This would require some understanding of
first principles which is scarce out there" -- could be a Jim
quote :^)

If you agree, perhaps we can take the view of a "cloud" as a
distributed memory store and durability via confirmed memory
replication to 2+ nodes as a given, and build on that base.

We could discuss how to organise the data in such a way that the
level of intra-node messages is minimised when faced with a stateless
web app like workload, and how disk backfill can be guaranteed to
keep up with committed changes.

Would that work, or does that come too close to the (undoubtedly
confidential) core of your ideas?

Paul