Subject Re: [Firebird-Architect] Re: Remote Shadows (again...)
Author Jim Starkey
Roman Rokytskyy wrote:
>> An excellent question. Yes, something like that could be done, but
>> let's look at the ramifications. First, that would require that
>> every server process maintain a separate connection to the shadow
>> server (not a bad name; somebody will probably want to change to
>> show he's paying attention; you should always pick a bad name
>> first). This complicates the shadow server.
>>
>
> Or you change the way you connect to the server. TCP is currently the
> only option that people start thinking about, but they forgot the full
> family of group communication protocols that use UDP multicast
> (unreliable protocol) to reliably deliver information to multiple
> nodes without requiring separate connection to them.
>
Sure, you could use UDP. But by the time you've build guaranteed
sequential delivery, you've reproduced TPC.

I guess that multi-cast could be used to talk to multiple shadow
servers, but making that reliable would be very, very hard.
>
>> On the other hand, if the shadow server is going
>> shadow more than one database, maybe that's unavoidable. Second, to
>> guarantee database integrity, the protocol would have to sequence
>> messages.
>>
>
> Automatically guaranteed by the protocol stack - you specify only the
> required properties (FIFO, Total Ordering).
>
Sorry, it isn't. If two processes on machine do writes on TCP sockets
to a single remote process, there is nothing that guarantees in what
order they will be delivered, even with an explicit flush. Unless you
have some magic up your sleeve, the only way to synchronize two classic
processes talking to a single shadow server is for one to wait for an
acknowledgment before releasing the lock on the page. On a single
socket, TCP guarantees sequence delivery. On two or more, you pays your
money and takes your chances...
>
>> This would require that the shadow server return an
>> acknowledgment to the database engine that the page message had been
>> accepted before the page lock could be down graded.
>>
>
> This schema has performance issues - usually an NACK is used - when
> node detects a hole in packets, it sends the message back.
>
Yes, that was my point, actually.
>
>> Finally, the database
>> attachment process in classic would require a hand shake with the
>> shadow server, slowing that down too.
>>
>
> Ok, here it is the same - joining the group is expensive. The only
> possible solution - join it and stay connected.
>
That, by definition, doesn't work for classic. But there are lots of
alternatives that could be investigated. For example, a single
forwarding process connected to classic by IPC. Probably OK for classic
but utterly unnecessary for superserver. Again, my point: In a combo
code base, the existence of class slows down superserver.
>
>> Yes, it could be done. But doing it in classic would cripple the
>> performance of the superserver.
>>
>
> I tell you one more schema - the Linda language family / tuple spaces
> to which classics connect via IPC and it then sequentially sends pages
> to shadow server. Also preservs the ordering.
>
>
>
Are you talking about the forwarding process described above or
suggesting that the shadow server could be on the same the machine? The
latter, in my usual humble opinion, is no good for server failover,
which is how we got into the discussion in the first place.

For what it's worth, Interbase was born with everything in the engine to
support a shadow server -- everything. Rather than a shadow server,
however, the engine talked to a journal server. The basic scheme was
that on first page update the cache manager allocated a second buffer to
hold deltas. When the page was written, the deltas (or the page itself
is the accumulated length of delta overflowed the secondary buffer) the
update was blasted to the journal server (gltj for your Interbase
historians. If the code were still there, all that would be necessary
would be to replace (or augment) gltj to maintain a shadow rather than a
sequence journal file.

Unfortunately, Borland had this hot idea about write ahead logs and
thought that a write ahead log could be both an after image journal and
write ahead log. In the process of implementing it, they ripped out all
of the original journaling system. When they finally understood that
their architecture introduced a single point of failure (the write ahead
log), they threw their hands in the air and punted the whole thing.
There's a lesson here, gentlemen (Ann isn't a gentleman, but doesn't
need the lesson, either).


--

Jim Starkey, Senior Software Architect
MySQL AB, www.mysql.com
978 526-1376