Subject Re[2]: [IB-Architect] Classic vs. superserver (long, very very long)
Author Nickolay Samofatov
Hello Ann,

Wednesday, October 09, 2002, 4:37:49 AM, you wrote:

> At 12:57 AM 10/9/2002 +0400, Nickolay Samofatov wrote:

>>There are ways of efficent interprocess communication.
>>Have you read Linux or WinNT kernel specs ?

> No, I haven't. The overwhelming problem with communication
> in the classic architecture, at least in my experience, is
> not the cost of IPC, but the cost of making it work on all
> the variants that appear in such disparate operating
> systems as Xenix, VMS, AIX, SCO, the Cobalt subset of Unix
> et al. If we limit ourselves to Windows, Linux, and MacOS,
> maybe that problem goes away.

FB2 already dropped support for obsolete systems. Windows, Linux,
and Darwin are not only modern systems out there. AIX supports rich
IPC API too.

>>I would fix FB2 threading model but now I have a lot of
>>work on my direct projects. I fixed FB2 enough to suit my
>>project basic needs.

> I'd very much like to explore what you've done.

I somewhat reworked DDL engine to make it reliable (when you have
one CS connection, when you have many you'll probally not be able to
do anything) and fixed a lot of crashes and database corruption bugs.

>>What I think needs to be done is to start from CS and make
>>it use resources better (and minimize syscalls). Then
>>analyze its code to make it thread-safe and rewrite
>>y-valve using SS code as an example to implement
>>multi-threaded multi-process model.

> Could you explain multi-threaded multi-process a bit more?
> Currently that's an either/or choice. I can imaging a multi-
> threaded version of classic that keeps the process per
> connection but adds separate threads to allow parallel I/O
> operations, sorts, etc. Is that what you mean?

No. The way like Oracle Threaded Server works.
There is a pool of worker threads distributed among of pool of processes.
Connection context is allocated for each connection.
Connection context is a simple data object which can be swapped out to
the disk, for example. It doesn't consume resources by itself.
There is a listener process (process pool, in fact) which handles
incoming requests and distribute them between worker processes.
When user request arrives request and connection context gets assigned
to a worker thread which does its work and posts finished work so
listener can return it to the consumer.
I didn't describe some details - there are some optimizations to minimize
context switches and context rebinding. Dynamic context binding is
recommended by Oracle only when number of connections is really large >>100.

>>It scales. I tested it hard. FB1 server i configured
>>handles load from 3000 concurrent users via ~100 active pooled
>>connections 10 hours a day 7 days a week in production environment
>>for more than a year on a 8-way SMP machine.

> That sounds great... How do you manage security using connection
> pools? Does that generalize? Clearly, maintaining open
> connections pretty much eliminates the need for a shared
> metadata cache. That's only important if you've got processes
> (connections) that exit and restart.

Firebird is essentialy insecure database (and it still
has a lot of bugs in its security code) and nobody can rely on its security.
I moved security and connection pooling one level up. It is managed by
my application server. It also works with Oracle and Mssql (and hides
most difference between them). You do Java object-oriented
programming, application server does most of relational work.
It automatically (by default) maps objects to data schema and XML.

>>I haven't tested FB2 SS, but FB1 SS and IB6 SS die and
>>corrupt database five times a day under under simular load.

> Interesting. I thought there were sites using about that
> load - meaning about 100 connections shared among many more
> users - with SS. What sort of data corruption did you get?

I really don't think so. SS doesn't scale at all now.
Our tests have shown that ~5 concurrent users is its limit for our
tasks. And there was a error which came out when you had a complex schema
with different table record sizes. Wrong size (and deallocated)
buffer was used during clearing of unneeded record versions.
It lead to memory corruption problems which sometimes also
caused database corruption. I fixed it in FB2.

>>As I remember that configuration - buffer cache was essentialy
>>disabled. OS file cache was huge (~2 GB, but much less than database
>>size). No IO bottlenecks where encountered at all, notwithstanding
>>that it was hybrid OLAP/OLTP database.

> Ah. OK. Were you using asynchronous writes? If not, then
> the OS file cache was serving as a shared page cache - why
> not?? (Except that there are operating systems that don't
> offer file caches, but hey, they're pretty much obsolete...)

Modern os's go even further. They allow you to mark consistent states
of cached data and limit time before cache data is flushed to disk.
In my case there was a script which asked OS to flush disk cache every
few minutes.

>> > A catastrophic error in a single connection doesn't
>> > affect other connections, while a server crash brings down
>> > all connections. The server shouldn't crash.
>>
>>It will crash sometimes. Just because shit happens.

> OK. But in your architecture, if a certain kind of shit
> happens (failure between the file system cache and the
> disk) you lose everything.

It is not an issue if you flush cache regularly.

>>If you kill one process from Oracle process pool probally
>>nobody will notice the result. This is the way large DBMS
>>designed.

> That's not the way either Sybase or MS-SQL works. And on
> Windows systems, DB2 using a thread-per-connection model.

>>I repeat, IPC should be _carefully_ designed based on
>>all features modern OS offer - shared memory, pipes, message queues,
>>semaphores, signals, memory protection, etc.

> No question. And as you track through the various bits and pieces,
> you'll find that classic uses shared memory - though there was an
> implementation once for a system that didn't have it - and pipes,
> though pipes have problems when you need to interrupt the guy
> at the other end - and message queues, semaphores, signals and
> mapped files. None of that stuff is new. It's just a lot of
> variants to deal with if the alternative is moving everything
> into one process.

> We agree, I think, that maintaining two architectures as different
> as superserver and classic is a problem. The question is which
> architecture offers the best performance in the general case. And,
> of course, that requires that we agree on what is the general case.

CS is stable and scalable. SS even didn't ever was stable.

>>This features
>>are available on most platforms (just have a look at Oracle list of
>>supported platforms and at syscalls trace of features it uses).

> And if we had Oracle's funding - if we even had the funding that
> Oracle has put into its America's Cup Challenge - we too could
> handle two architectures.

This is real truth. Just several M$ in development and QA and FB2
could be made easly compete with Oracle RDBMS.

>>I think acceptable model is combined multi-process multi-threaded
>>server. This is optimal for some platforms (like Windows)
>>which have optimized threading and is good for stability.

> I'd like that better if it didn't require a pool of connections
> and didn't violate the requirement for careful writes.

Of course.

[...skipped...]

>>2)Allocate write-protected area in the shared memory to store
>>object versions associated with objects.

> Sure, but ... isn't that just a variant of a share metadata
> cache - the larger part is kept in the process, to be sure,
> but the idea is the same. In your case, the cost is lower
> because the connections are stable. Aside from the fact
> that shared memory used to be a scarce resource, why not put
> the whole metadata caches there?

1) Shared memory should be threated carefully to avoid its corruption
in case of process crash
2) If you have a clustered machine which simulates shared memory (via
some kind of network) costs of writing to shared memory may increase.

>>3) During normal operation do lookups and compare local
>>cached object version and version in the shared table.

> That's actually where the problem occurs. Once a request
> is compiled, there's no further checking that the assumptions
> that went into it are still valid. Nor is there any check
> on the validity of structures used by a running request.

> If one process is running a query that depends on an index
> and another deletes the index, the running query is going
> to hit a wall when it reads what it expects to be an index
> page and discovers that it has been released and reallocated
> as a data page.

It doesn't happen now because prepared queries make existance locks
on indices and you should never be able to delete any resource used by
compiled request. It seems there is a bug with handling of procedures
here but other objects should be ok now. In general, you should
receive "OBJECT IN USE" error when you try to do almost any DDL when
having several classic connections because of metadata caching.
I worked on this code much recently to make it stable.

>>I can describe simular methods of work with locks and buffer cache.
>>Look at my algorothm - it keeps process shared data amount at minimum
>>and protects it. This makes system robust and efficient in clustered
>>environment.

> Clustered? Hmmm. OK. How do you communicate page locks
> in a cluster?

Via the usual IPC methods - shared memory, semaphores, etc... OS kernel
provides them in clustered environment too.

>>I designed and implemented several products people have to really rely
>>on (like real-time medical database which works with life-support
>>devices) and I know basic principles of it. They are:
>>1) There should be no single point of failure
>>2) Shit happens

> From what I understand, your reliance on the file cache is a
> single point of failure and endless shit is likely to happen
> if someone drops an index on a running system. What have I
> missed?

I described solutions upper.

>>Your model breaks both principles. It will not be reliable.

> OK, lets work on making a model that is reliable.



--
Best regards,
Nickolay Samofatov mailto:skidder@...