Subject Re: [Firebird-Architect] RFC: Clustering
Author Eduardo Jedliczka
Hi,

(I´m only a Brazilian FireBird user without any relationship with FB-Devel,
and ignore-me if I talk some bullcheat, and sorry for my badly english )

The true: I loss a lot of sleep time thinking about FireBird current
structure and how gain performance under cluster. My C++ cababilities is
near ZERO.

First off all, I think CLUSTERS NODE fabulous, and I like a lot of yours
commentaries, you was used a lot of time thinking on it. But I beliave
shadows is not the better way. A Internal soluction like a flag used in
NBackup (with a PageVersion from any cluster node) much better...

Of coarse we need the primary database server. But if we turn off a
secondary cluster node for a short time. I think the recovery time is very
long with shadows (again I talk... it´s only my POV), and copying only
changed pages is a fast way to do the same thing.

I conclude that because I´m studding make this possibility off-side server!
Implementing this feature in a external program ( like ibmonitor)
"monitoring" SQL Packages from client to primary server, and mirroring DML
to other server nodes, and distributing SQLs from most un-ocupped server.

My first problem was UDFs (with access external files, drivers O.S.
resources or web access), and some "update Stored Procedures"... but IF
cluster support is made inside the DataBase... this operations can be
"replicated" correctly in cluster nodes.

The other problem is USER control (grants, etc) ... but it´s other story...

======================
Eduardo Jedliczka
Apucarana - PR
Brazil.
======================

----- Original Message -----
From: "= m. Th =" <th@...>
To: <Firebird-Architect@yahoogroups.com>
Sent: Friday, November 17, 2006 8:50 AM
Subject: [Firebird-Architect] RFC: Clustering


> Hi,
>
> This is a draft about Firebird clustering, feel free to read and
> comment. This is the hot-swap variant (ie. true clustering). The
> cold-swap (ie restart application in order to go on other server with
> data in sync) is already done if Dmitry wants to commit the remote
> shadows feature for Windows which I tested and benchmarked for him. This
> document is a draft based on our small researches in the area and
> experience with similar products. If something isn't clear, please ask.
> Sorry for the long and convoluted text (a Word document available on
> request). Also, if someone is interested, we can provide some diagrams.
>
> 1. Architecture
>
> The bare thing:
> The client will have more connection strings (separated by ;) sending
> all the commands to all connections defined by these strings. Only one
> connection, if is available, will return the data. The other connections
> are used for data and context synchronization.
> If you want to know how will get these connection strings, how to deal
> with slow connections, node failures, rebuilding etc. read on.
>
> 2. How the cluster is created
> a. The goal
>
> The goal is to create identical copies of the database on each cluster
> node, having inside the list with the database files from the other
> nodes, stored in RDB$Files with the appropriate flags.
>
> b. A proposed way (using SQL)
>
> Issuing a CREATE CLUSTER NODE <remote file> will use the remote shadows
> engine to create a copy of the database there and to put the appropriate
> flags in the header of the file and records in RDB$Files, both on the
> local file as in the remote one. This seems a little bit tricky, because
> it requires to FB server to work on that computer… but I don’t think
> that this would be a big request. Also, the isql utility, (or the
> equivalent) must be able to connect on the new node in order to update
> the RDB$Files there. For example:
> We are on Server-A:C:\MyDB\DB01.FDB and we issue: CREATE CLUSTER NODE
> Server-B:C:\MyDB\DB01.FDB; and CREATE CLUSTER NODE
> Server-C:C:\MyDB\DB01.FDB;
> On all the servers we’ll have in the RDB$Files (at least) the following
> three records:
> Server-A:C:\MyDB\DB01.FDB
> Server-B:C:\MyDB\DB01.FDB
> Server-C:C:\MyDB\DB01.FDB
> IMHO, in the flags field we can put the node status: Data server,
> on-line, offline/outdated.
>
>
> c. Things to consider
> i. All the clients must be out (perhaps is enough ‘deny new
> transactions’ shutdown mode)
>
> In other words ‘Secondary attachments cannot create clusters’ – else the
> already connected users to the database which is currently promoted to a
> cluster will remain attached to ‘Server-A:C:\MyDb…’ and the new ones
> will connect to all ones, thus generating a data de-synchronization
> between the nodes.
>
> 3. How it works (an example)
> a. In normal conditions
>
> When a client (let name the clients CL1, CL2, ..., CL100) will connect
> to one of these databases who now are nodes in a cluster, the server on
> which is connected will send back the new ‘connection string’ which is
> formed from all of node clusters and then we’ll have, behind the scenes,
> an app with the following connection string:
>
> Server-A:C:\MyDB\DB01.FDB; Server-B:C:\MyDB\DB01.FDB;
> Server-C:C:\MyDB\DB01.FDB
>
> (of course, the developer typed in the TIBODatabase.Databasename the
> ‘Server-A:C:\MyDB\DB01.FDB’)
>
> The client will send all the commands to all the three servers. The DML
> commands are obvious; the SELECTs are in order to ensure the hot-swap
> capability (in the case of failure we need the contexts to be in sync).
> IMHO, this architecture are better than a “relay” or “emitting” server
> which, after updating its data, will replicate it on the other nodes,
> also from performance and complexity POV: server NIC will have a heavy
> traffic each incoming data chunk must be sent back to other nodes,
> server CPU which must do the replication process, heart beating the
> other nodes yielding their state, replicate the contexts over the
> cluster nodes, in other words the possible cursors, selects, selectable
> stored procedures in progress, transactions ID aso. Also, the
> replication process tend to be serialized since is the same engine which
> does it. (Of course you can do it on one thread/cluster node but the
> simultaneous changes from many clients will remain serialized/each
> thread). (We have “not so nice” experiences in this area with Windows
> clustering, anyway a nice try and big amount of work from M$, but the
> clustering engine was to slow and ‘fuzzy’ for our purposes) Also, using
> a DLM with a storage quorum in middle will require expensive hardware in
> order to make things fast and, also, introduces a new point of failure:
> the shared quorum storage. That’s why I propose a Majority Node Set
> cluster implementation in favor of Single Quorum Device cluster. But,
> this implementation doesn’t exclude the DLM approach. Also, the Majority
> Node Set model can be used on geographically dispersed nodes, which
> isn’t the case with the DLM.
>
> The first server from the connection string (the Server-A) will be, in
> the normal case, the data server and will return the data (rows from
> SELECT, rows from selectable procedures, output parameters aso.). This
> will be established at the login time. The client will supply at each
> cluster node a login structure like:
>
> Username: string;
> Password: string;
> <…many other things…>
> LoginType: integer; //0 – normal mode (data server), 1 – mirror mode
> (normal cluster node)
>
> And will have locally a list with nodes like:
>
> Server
> <…many other things…>
> ServerStatus: integer; //0 – online data server, 1 – online mirror
> cluster server, 2 – offline/failed node
>
> If a client issues a ‘select’, if the LoginType will be 0 then the
> server will act normally. If the LoginType will be 1 then the server
> will return ‘ok’, begin executing the query (in a separate thread?) and
> at each ‘fetch’ issued by the client will advance through the result
> set, updating its buffers, returning ‘ok’, without returning the actual
> data over the wire, due to performance issues. The client will expect
> only the ‘ok’ packet due to ServerStatus member from the structure.
>
> b. Election of data server
>
> At login time the clients will try to connect to all servers provided in
> the connection string. If the logins are successful, then each client
> will query each server for the CURRENT_TRANSACTION environment variable.
> All the nodes which have CURRENT_TRANSACTION <
> Max(CURRENT_TRANSACTION(<Cluster>)) are marked with ServerStatus=2
> (offline/failed node). If are more nodes that have CURRENT_TRANSACTION =
> Max(CURRENT_TRANSACTION(<Cluster>)) then check how many clients are
> attached on each node. Now, based on a configuration parameter stored in
> .config file we have two policies: ivory-tower and brotherhood (sorry
> for the ‘odd’ naming…). In the ivory-tower policy, the node which will
> have the maximum number of clients attached on it will be the data
> server. This is in the case in which we have very safe / very good /
> very fast server compared to the other nodes. In the brotherhood policy,
> the node which will have the minimum number of clients will be the data
> server. This is useful for load-balancing, when the servers are close in
> technical specifications, network speed, availability etc. If more nodes
> will have the same, maximum (or minimum, depending on policy) number of
> clients then the data server will be the first of these nodes which
> appear in the connection string. (The client will update both
> ServerStatus locally and LoginType on the server). Is important for the
> client to send the commands to all nodes simultaneously (in separate
> threads) in order to ensure that at any moment in time, there aren’t
> discrepancies between TIDs between the nodes. This is necessary in order
> to not mark as ‘offline’ nodes which are in fact ‘on line’.
>
> c. Failure of data server
>
> Mark the ServerStatus of the corresponding server with 2 (offline/failed
> node), kill the sending thread to the failed node (the old data server),
> do a new election of data server, and the client library must be able to
> continue fetching data (rows, parameters) from the new data server. The
> DML commands (if they exist) should continue on the other mirror sending
> threads. Breaking error must be raised only if no more nodes are
> available.
>
> d. Failure of mirror server
>
> Mark the ServerStatus of the corresponding server with 2 (offline/failed
> node) and kill the corresponding sending thread.
>
> 4. Cluster gateway
>
> Perhaps is very good to have an optional program which will act like a
> gateway for the cluster. In fact, it will be in fact a middle tier,
> between clients and the cluster nodes, routing the client’s requests to
> the active cluster nodes. It will have the same engine as above. How it
> works:
> a. The gateway (located on ‘Server-R’) reads (from a configuration file)
> the listening port (let’s say that it is 3050) and the ‘routing table’
> in the form something like:
> Server-R:C:\Fdb\DB01.FDB=Srv-A:C:\Fdb\DB01.FDB;Srv-B:C:\Fdb\DB01.FDB
> Server-R:C:\DB02.FDB=Srv-C:C:\DB02.FDB;Srv-B:C:\DB02.FDB
>
> b. When a client will connect to the Server-R:C:\DB02.FDB the middleware
> using the protocol described above will connect to Srv-C and Srv-B and
> thus will transparently serve the clients. Any cluster node state change
> will be logged somewhere.
> Things to consider:
> a. The speed of sending commands over the wire will be increased between
> clients and the gateway. This is useful in rare cases in which we have a
> slow connection (read: Internet, wireless) and we have to upstream much
> data (BLOBs mainly)
> b. The gateway is a new point of failure. That’s why it should remain
> optional. Only who wants to use it, should enable it.
>
> 5. Things to consider
>
> a. The cluster is not to be spread on different segments
>
> Consider the following network configuration: (for those of you which
> don't see clearly the 'drawing' change the font to a fixed-size one e.g.
> Courier)
>
>
> CL1 Server-A
> | |
> CL2 - Switch 1 ----- Switch 2 – Server B
> | |
> Server-C CL3
>
> Let’s say that the data server is Server-A
> If appears a network failure between Switch 1 and 2 (the cable is
> broken) then CL1 and CL2 will elect as new data server the Server-C and
> CL3 will remain with Server-A as data server. Such network
> configurations should be generally avoided or use a gateway (see above).
> This is a general problem with clusters. But, IMHO, I rather prefer to
> have a switch as a single point of failure rather than an application
> (the gateway which is, in fact, the middle tier between the components).
> Also, if the switches support (R)STP ((Rapid) Spanning Tree Protocol)
> and are configured properly, then we have no point of failure.
>
> b. Very simple configuration
>
> In other words, there isn’t something like a ‘cluster console’ a
> centralized cluster administration, (though it can be easily built), and
> the cluster can be made (through a select on RDB$Files) isn’t
> transparent to developer (is transparent only to the user), thus leaving
> to the developer the entire responsibility (and flexibility) to
> configure it through API or (please implement) SQL commands.
>
> c. No administration costs, near to 0 down time
>
> On the other hand, the administration costs are (near to) 0 and
> ‘rebuilding’ the cluster means in fact a ‘simple’ file copy (here, of
> course, a variant of CREATE SHADOW will be much indicated, due to issues
> with data corruption, shutting down all the servers etc.). Also, if a
> SQL command (based on CREATE SHADOW engine) will be implemented in order
> to copy the data to the new/restored nodes then we’ll have (near to) 0
> down time. If you use a gateway, this can be automated.
>
> d. The developers can build ‘x’ clusters on ‘y’ servers
>
> Because the clusters are tied to databases, the developers can use the
> Firebird servers for building different clusters with as many nodes they
> want. For example:
>
> In application Erp.exe we’ll have the following connection string (built
> by the client from RDB$Files):
> Srv-A:C:\ERP\2007.FDB;Srv-B:C:\ERP\2007.FDB;Srv-C:C:\ERP\2007.FDB
>
> On the same system in the Chat application we’ll have:
> Srv-B:C:\FDB\Msg.FDB;Srv-C:C:\FDB\Msg.FDB
>
> On other app we’ll have only:
> Srv-C:C:\FDB\ErrorLog.FDB
>
> As you see, a developer/DBA can use, depending on his needs, as he
> wishes the servers which he has.
>
> e. Architecture independent
>
> This works for all configurations: super server, classic and embedded
> and on any platform (and mix of platforms). Anyway, for embedded, I
> don’t see much use…
>
> f. Advanced load balancing
>
> Yes, is nice to have but the simple load balancing used by distributing
> the clients to the minimum used server (each client having other data
> server, going on the most free one) will do a good job. Also, on
> Windows, at least DFS is implemented in this way. I don’t know for
> Linux. For advanced load balancing (at the first glance) it can be
> implemented, but then you must have a cross-platform remote performance
> counter yielding engine On Win NT cores (at least XP/2003 it exists, not
> sure for Win 2000, don’t know for Unix/Linux). Or you can provide only
> the changing the data server temporary based on an API function. Anyway,
> if this isn’t implemented having a failure-tolerance/simple
> load-balancing clustering is a big gain already.
>
> g. Reliability
>
> Because the transactions are started (and ended) simultaneously (no new
> TID is issued for any node until the other nodes catch up – the client
> must wait for the nodes) it ensures that the cluster is always in a
> stable state giving us a very reliable fault-tolerance cluster. On the
> other hand, one of the cluster bottlenecks is the speed of the slowest
> connection, but if this is an issue you can use the gateway. Also, the
> system has only one point of failure: the main switch (which is a
> hardware piece) and/or the gateway system (ie. the application itself,
> the computer aso.) if is used.
>
> Hth,
>
> m. th.
>
>
> ----------
>
> PS: Any comments? :)
>
>
>
> Yahoo! Groups Links
>
>
>
>