firebird-architect - Re: [Firebird-Architect] RFC: Clustering

Subject	Re: [Firebird-Architect] RFC: Clustering
Author	= m. Th =
Post date	2006-11-20T11:19Z

Alexandre Benson Smith wrote:

> = m. Th = wrote:
>
>> No, is solved. If we implement this by allowing the fbclient.dll to send
>> /simultaneously/ the commands to all nodes. A intresting thing which I
>> see now though is the time to switch to another node/connection thread
>> and continue fetching data from there when the fbclient kills the main
>> connection thread due to 'unable to connect to host' error. But I wait
>> the opinion from the developers.
>>
>>
> Hi,
>
> I have read your proposal, really don't know how feasible it is (I am
> not arguing against *your* proposal, but the "behind the scenes work"
> that pass my superficial knowledge about FB).
>
> A point I can't understand is how /simultaneously/ would be achieved.
> I understand that the fbclient will have a connection to each node and
> send the same statement (insert, update, delete, DDL, selects don't need
> to go to each one, but just one that will answer the query) , this need
> to be atomic, an statement should only be considered committed once it
> goes to all nodes, if a node fail, how it will be rolled back on the
> others (two-phase commit ?), or once one of the nodes couldn't be
> reached it will just be marked as outdated ?
>
>

Yes. Basically, let's keep in mind that "a cluster is a virtual
computer". In deed, we have three nodes N1, N2, N3. The client will send
the command to each node on a separate thread (with, perhaps, different
execution speeds) but a commit or rollback will be done in sync IOW, all
the other threads will wait for the slowest thread to terminate (either
successfully either with a 'host not found' error). Doing a separate
commits/rollbacks will break the atomicity and the "virtual computer"
will be a image broken in a thousand pieces.

As an aside, also the Selects must _go_ to each node, in order to have
the proper contexts ready for switching, but the actual data is
_returned_ only from one. Let me explain. On our nodes we have a table
T1 with 100 records. We issue a SELECT * FROM T1. My TDBGrid forces the
fbclient to fetch 23 rows to fill it thus the 'cursor' advances on the
server side result by 23 records. At this moment the data server
crashes. To gracefully switch to another data server and fetch the next
23 rows (thing which will happen if I'll press the Page Down key inside
of my DBGrid) we need to have in the new data server the same context
pointing at the same row. If it wasn't clear, I can build a diagram for you.

> Another question, now talking about distributed nodes using a gateway
> (AFAIU this is the way you suggest distributed nodes should be used)
>
> If it will be marked as outdated, how frequent will be all nodes be
> marked as oudated ?
>
>

Once. First time. The 'oudated' status of the failed node is stored in
the RDB$Files table from the db of each node and the clients know to not
use it. Requires DBA intervention in order to bring the node on-line
back again. Because the synchronizing / updating / fixing a (possible)
broken db is very complex the node is activated by overwriting the
problematic db file (this in case in which exists).

> On time T1 host in Brazil could not be reached by the gateway, so it
> will be marked as outdated
> On time T2 host in Canada could not be reached by the gateway, so it
> will be marked as outdated
> On time T3 host in "Your office" could not be reached by the gateway, so
> it will be marked as outdated
>
>

I didn't understand exactly what do you meant. I'm sorry. Perhaps can
you explain in other way?

> Another question will be:
> If I have a node on my Branch in Brazil, my traffic should go to the
> gateway on France, that will connect to my node in Brazil get the
> response back, while I could reach it inside my LAN (the same for the
> other branches), lookslike an overkill to me.
>
>

Of course. Because this model is working (as you stated bellow) _only_
(and _only_) on a self-contained LAN (or high-speed) segment. I think
that this constraint isn't so hard to achieve. You can, for example, put
the computers in different rooms, with different UPS/power sources,
different OS etc. N-way replication between scattered nodes on n-data
paths, this is the thing which makes the clusters so scary, fragile and
complex. I don't try to build a perfect cluster configuration, I try to
propose one easy to implement which with reasonable constraints will
accomplish our goal. And (as Jim stated) there isn't a product which can
do all the things as you also know.

> But for fault tolerance and all hosts on the same LAN segment (or
> connected by extremelly fast and reliable links), I understand it could
> work.
>
>

Exactly.

> Another point:
> Even if the connections are really fast and all hosts on the same LAN
> the order of propagation of the updates by the fbclient to all hosts is
> fixed ?
> Let's take this scenario running simultaneously:
>
> T1 = Transaction 1, T2 Transaction 2, C1 = Client 1, N1 = Node 1 and so on.
>
> C1 - T1 - Insert into Children (ParentID) (1) on N1; (ok)
> C2 - T2 - Delete from Parent where ParentID = 1 on N2; (ok)
> C1 - T1 - Insert into Children (ParentID) (1) on N2; (fail FK dependence)
> C2 - T2 - Delete from Parent where ParentID = 1 on N1; (fail FK dependence)
>
> Let's suppose that for some unknown reason the DELETE statement from
> C2-T2 runs faster on N2 than on N1 the statement only will be considered
> "done" when all nodes send a sucessfull response ?

Yes. Nothing which the others will see will "run faster than". This
breaks the "virtual computer" paradigm. As I stated before, the client
engine will wait the slowest thread. And, also, if one command will "run
faster" on a node then most probably will run faster in all the cases
and imagine what will happen on heavy load (bulk inserts for ex.). The
desynchronization will be bigger an bigger...

> The rollback would be
> handled automagically on the "sucessfull" nodes by the client ? Node 1
> will not send an error to C1 - T1 in INSERT neither Node 2 will send an
> error to C2-T2 on the DELETE.
>
> Did I get it wrong ?
>
>
>

Because each all nodes will be in sync there will be the same response
from all. If a node, for any reason will be out of sync, will be marked
as outated.

>> Usually the Firebird is very fast. Perhaps you can expose your problem
>> (with queries, configuration aso.) in firebird-support group. The main
>> goal of clustering in my implementation isn't to achieve a throughput
>> improvement.
>>
>>
>
> Eduardo is a very experienced FB developer (from the user point of view,
> instead of core FB developer). I don't know the specific performance
> problem they are talking about, but I can guess he is talking about
> distributed branches across the country and a server in the main office
> and all branchs linked by (slow) ADSL connections (it's a very common
> scenario here in Brazil), and he is talking about replication to have a
> server at each branch that does a N-way sync to the others.
>
>

My model will work for him if he'll have a 'cluster room' at his main
office with, let's say, 5 computers, 4 fbserver nodes and one with the
gateway running. (Also he can put the gateway directly on one of the
nodes and connect to the node locally if he wish). The nodes are
connected to a 1Gb switch (also 100Mb will do with very, very good
results for a wide range of applications) and all the ADSL clients will
connect on the 5th computer. This should assure a pretty high
availability to him.
Also, if the application(s) don't deal with heavy data upload (ie. big
blobs, images, sounds, video aso.) then he can use safely the system
without gateway thus eliminating a point of failure (as you know the
biggest command can be only 64k and 64k X 4 = 256k which is send in
no-time over ADSL - of course I don't know what 'slow' ADSL means ... :) ).

> see you !
>
>

Your input much appreciated...

hth,

m. th.