firebird-architect - Re: [Firebird-Architect] Re: Well, here we go again

Subject	Re: [Firebird-Architect] Re: Well, here we go again
Author	Jim Starkey
Post date	2008-06-17T16:30:42Z

paulruizendaal wrote:

> Hi Jim,
>
> It would be interesting to widen this thread to a discussion of
> highly scalable web apps and the databases underpinning them. I agree
> with some of your observations, whilst others seem out of place. As
> you opened this thread, I assume you are interested in having such a
> chat.
>

Sounds like fun. Not all open source database projects like kicking
ideas around.

> Perhaps first a few remarks for the benefit of the rest of the
> community who may want to participate. There is little magic in
> scaling web apps out, although a lot of engineering skill is
> required. For instance, the web log "high scalability" has
> descriptions of the architecture of many high volume web apps and
> links to blogs and presentations by their key engineers (see
> http://highscalability.com/).
>
> In very broad strokes, being highly scalable boils down to being as
> stateless/RESTful as possible and assigning individual bits of
> workload across a broad group of machines. Caching CPU intensive
> workload steps is also often part of the design. As said, little
> magic, but lots of skill involved.
>

I have to disagree here. There are two different types of problems out
there and, not surprisingly, they require different solutions.

Very large, hard problems are solvable in the manner you describe,
usually with a variation of MapReduce (Hadoop). Certainly Web wide
search falls into this category. These are certainly important
applications, but they are also rare. Google Web search is based on
MapReduce (though not Hadoop). Google's implementation is cloud based.

Another type of application is Amazon, where every generated page
contains data from probably a dozen separate applications with little
cross talk. Firing off requests to each of the applications and
integrating the results make a lot of sense. This is an example of what
is generally called grid computing where specific servers handle
specialized tasks.

But most of the remaining 99.999% of Web applications aren't like Google
or Amazon. They may have a large amount of data, but the data tends to
cluster naturally. And, of course, if they're successful, they have a
ferocious hit rate.

The metric for Web applications is page latency at a given load. If a
page takes a major part of a second to delivery, the users are going to
get bored, go someplace else, and the hit rate drops, and somebody's in
trouble.

The sources of page latency are:

* The time of a network round trip between the app server and the
database server times the number of trips required to generate a page.
* Network bandwidth (usually negligible)
* The aggregate database processing time
* Process context switches
* Thread context switches

A decent dynamic Web page takes one to two dozen queries. That many
round trips between the app server and the database is untenable, so
almost all Web applications pre-compute boilerplate, etc. Not
desirable, but generally recognized as unavoidable.

That still leaves a modest number of critical queries to generate a
useful Web page. Given that general purposes database systems
(presently) run on a single computer, something has to give to sustain a
high hit rate. Here are some of the solutions:

* Read replication, where on master replicates to hundreds of slaves
to do most of the work.
* Memcached, where queries and associated result sets are cached in
memory on the app servers
* Sharding, where "shards" of related data are distributed to
logically independent database server. This generally requires
either replication or data redudancy.

These solutions are rife with problems, however:

* None are ACID. Stale data abounds.
* The latency for an update to propagate grows with the size of the
network.
* These put the burden of scalability on the application programmer.

The actual problem is simpler: Modern database systems aren't up to the
demands put on them.

What has to give at the really big guys is the idea of a complex, ACID
database. The single table, complex row model is a compromise between
application programmer friendly and what the cloud community knows about
building database systems. The model is attractive but it is easy to
replicate for both performance and availability while minimizing network
traffic to a single large large. Oh, and with only single record
updates supported, concurrency and consistency control isn't that hard.
It also works very well with the multiple separate application model for
Web page generation.

There are other less popular but interesting models. One is
Netfrastructure, an example of an aggregating interface. In the
Netfrastructure architecture, the application code runs in a JVM inside
the database server, allowing a page to be generated with a single round
trip between the Web server and the database server (the database
essentially becomes the application server, skipping a whole process).
The advantage of this architecture is that the overhead of a JDBC
database call is on the order of 10 nanoseconds rather than 1 to 10
milliseconds for a round trip between a traditional application server
and a database server. There are lots of other forms of aggregating
interfaces using serialized object streams, XML, or other structures for
returning results of many queries in a single exchange. All, however,
require application logic in the database server.

> Let's turn to the user requirements next. "There is a strong trend in
> cloud computing away from relational databases, incidentally.
> Google's BigTable, Amazon's SimpleDB, and Microsoft's unannounced
> cloud database service all abandon the relational model for different
> single table, complex record models." True, but I disagree with the
> reason you give for this, i.e. that relational stuff won't scale. In
> my view most of the current web apps (a web shop, a facebook, a
> twitter, google, etc.) simply don't need a relational database; the
> needs of the app designer are often better served by the "single
> table, complex record" model.
>

They are forced to use a cumbersome, restrictive data model that
requires redundant data storage because that is the only alternative
available. Nobody even thought of doing those sorts of unnatural acts
until it became clear that existing relational systems could never meet
their throughput and latency requirements.

I believe that the relational model has passed the test of time and is
organization of choice, other things being equal. The hierarchical,
network (CODASYL), and "object oriented" database have all died while
relational systems have thrived. The issue is not the model (though I
prefer semantic extensions) but the implementation.

> "Existing relational products are designed around single node
> technology that has neither the reliability nor scalability these
> guys require." For the purposes of this discussion, that statement is
> too absolute. Of the sites that do use or need a relational backend,
> most engineers report that the database was initially a bottleneck,
> but that it could be solved via set of common techniques (caching,
> sharding, master-slave replication, etc.)
>

Sharding can be ACID, but requires data redundancy. The rest aren't
ACID. ACID is important. There is nobody in his or her right mind who
would forgo ACID if an ACID solution met the performance requirements.

> Take the example of salesforce.com that really needs a true
> relational backend and has solved the scalability issue using of-the-
> shelf technology. Salesforce.com uses an Oracle RAC cluster built
> around regular "lintel" harware. There is no magic in RAC,
> essentially it is Firebird classic, using a distributed lock manager
> and cache fusion (i.e. when a cached page is written to, the caches
> on other machines are not invalidated but push updated via a high
> speed network link).
>

Last I looked, Salesforce used a zillion MySQL servers with each
instance a separate MySQL "database". Their limitation is the number of
files in a directory. This is not to say that like this model, just
that it gets the job done.

> So, the challenge isn't so much in figuring out how to make it work,
> the challenge is in making scaling out so simple that even a school
> leaver can do it right, thus pushing down the TCO of a scalable web
> app to that of a simple LAMP system (i.e about $15/yr per user seat).
> If NimbusDB scales out to hundreds of standard lintel boxes whilst
> being as simple as Firebird to install and operate, you've cracked
> the problem.
>

That's what I'm trying to do -- make a single ACID database out of
hundreds of replicating commodity boxes. Or maybe two or three
commodity boxes, if that's all it takes. The basic rule is that the
cloud must be elastic -- nodes can be easily added when needed or
removed for maintenance. The database system must be cloud resident
with no single point of failure. This requires different thinking and
different architectures.

> "There will aways be a point of diminishing returns where the delta
> increase in replication cost cancels the benefit of an additional
> node. But if this point happens at a hundred nodes, we will still see
> a 100X gain in performance over a single node system." You probably
> don't mean this literally, but if you do, I would challenge the
> arithmetic.
>

Compare Nimbus and Falcon. Each is designed to operator mostly out of
memory. Falcon, however, has to go to great lengths to synchronize
memory with a disk. The overhead of the synchronization and I/O are
significant drains on performance. Pure in-memory (i.e. non-persistent)
relational systems are much faster, but with the obvious drawback. If
each node in a cloud can execute at the speed of an in-memory database,
the cloud would scale linearly. But to make this work, we need
replication, and the cost of replication grows with the number of nodes
in the cloud. There are lots clever things we can do to minimize the
cost of replication, but we can avoid it.

My best guess is that each Nimbus SQL node will execute about 3X Falcon
with a performance cross over at about a 100 quad-core nodes. If those
numbers are correct, a cloud of 33 Nimbus SQL nodes should give
approximately 100X Falcon.

> Finally, let's turn to database design choices. What I understand
> about the NimbusDB concept comes from these quotes: "No, it isn't
> remotely similar to either. For example, nodes that do SQL don't do
> disk I/O, nodes that do disk I/O don't do SQL, and all files are
> written once and never updated." and "Pages: Just say no."
>
> The first statement is intresting, but seems besides the point of
> scalability. All relational database are internally divided along
> those lines since their inception, eg. in Firebird the code
> above 'looper' is virtually independent from the code below. Of all
> databases I know, SQLite seems to be architected the most cleanly in
> this area, with a well defined virtual machine separating the two
> halves. Still, it is a cheap way of getting from X to ~2X performance.
>

The cost of a relational database engine isn't the SQL engine. Even a
really stupid SQL engine is fast. The problems are synchronizing memory
and the disk to be ACID.

Firebird uses careful write. I think that's too slow to be
competitive. Falcon uses a single write serial log, as do many modern
engines. That works, but the synchronization is still very complex and
expensive.

> The bit about all files are written once and never updated is more
> interesting. This is a feature that is found in stuff like BigTable
> as well and is a choice driven by caching. The big challenge with
> caching is not the caching itself (just throw memcached at the
> problem), but the invalidation of the cache(s) when underlying data
> changes is a tough problem. Write only is a clean solution to the
> invalidation problem.
>
> "Write once & only" perhaps seems weird at first glance, but it is
> not so hard and working GPL'ed code exists:
> http://www.primebase.org/download/ See previous post for a link to
> the architecture whitepaper of PBXT.
>

Even better is having SQL nodes never write at all. The network is
better redundancy than a disk. Disks are bigger than memory, though,
and more persistent, so things ought to get written to disk, but there
is no reason for a SQL node to even bother.

Another thought to ponder is the idea of the cloud as an L2 cache.

> The "pages: just say no" seems to fit in with the above. A write once
> & only system doesn't need pages for storage management. Also, in a
> distributed version of this design it would probably be best
> to "ship" update logs of individual records, not of pages, to the
> target machine. The target machine would be found through some sort
> of stable hash, similar to memcached.
>
> In such a design, indeed it would be possible to scale a system by
> simply adding boxes and very little scaling engineering skill being
> required from the side of the DBA/app designer.
>
> Your other design ideas for NimbusDB seem to be unrelated to the
> scalability issue and a continuation of ideas floating around in
> other systems:
> - semantic data model: seems an extension of concepts in Postgres
> - unbounded types: a slight regression from SQLite's manifest typing
> - utf8/collations: a slight regression from SQLite's design
> - multiple requests per round trip: wasn't that in the HSQLDB http
> protocol?
>

Postgres tried a stupid implementation of the semantic extensions, fell
flat on their face, and gave it up. Various versions of the
"object-relational" model contain the semantic model as a subset. But
I'm just trying to make a useful database system rather than get a PhD,
so I don't need the rest of the baggage.

> All in all, you seem to have a great idea for an easy to use, highly
> scalable database. As I understand it, it is more a continuation of
> ideas & concepts that have been brewing for years and not a total
> break with the past. However, perhaps I'm just not getting it.
>
> Looking forward to your clarification & rebuttal :^) Other community
> members: feel free to join in with views & observations.
>
>

Indeed.

--
James A. Starkey
President, NimbusDB, Inc.
978 526-1376

[Non-text portions of this message have been removed]