firebird-architect - RFC: Clustering

Subject	RFC: Clustering
Author	= m. Th =
Post date	2006-11-17T10:50:08Z

Hi,

This is a draft about Firebird clustering, feel free to read and
comment. This is the hot-swap variant (ie. true clustering). The
cold-swap (ie restart application in order to go on other server with
data in sync) is already done if Dmitry wants to commit the remote
shadows feature for Windows which I tested and benchmarked for him. This
document is a draft based on our small researches in the area and
experience with similar products. If something isn't clear, please ask.
Sorry for the long and convoluted text (a Word document available on
request). Also, if someone is interested, we can provide some diagrams.

1. Architecture

The bare thing:
The client will have more connection strings (separated by ;) sending
all the commands to all connections defined by these strings. Only one
connection, if is available, will return the data. The other connections
are used for data and context synchronization.
If you want to know how will get these connection strings, how to deal
with slow connections, node failures, rebuilding etc. read on.

2. How the cluster is created
a. The goal

The goal is to create identical copies of the database on each cluster
node, having inside the list with the database files from the other
nodes, stored in RDB$Files with the appropriate flags.

b. A proposed way (using SQL)

Issuing a CREATE CLUSTER NODE <remote file> will use the remote shadows
engine to create a copy of the database there and to put the appropriate
flags in the header of the file and records in RDB$Files, both on the
local file as in the remote one. This seems a little bit tricky, because
it requires to FB server to work on that computer… but I don’t think
that this would be a big request. Also, the isql utility, (or the
equivalent) must be able to connect on the new node in order to update
the RDB$Files there. For example:
We are on Server-A:C:\MyDB\DB01.FDB and we issue: CREATE CLUSTER NODE
Server-B:C:\MyDB\DB01.FDB; and CREATE CLUSTER NODE
Server-C:C:\MyDB\DB01.FDB;
On all the servers we’ll have in the RDB$Files (at least) the following
three records:
Server-A:C:\MyDB\DB01.FDB
Server-B:C:\MyDB\DB01.FDB
Server-C:C:\MyDB\DB01.FDB
IMHO, in the flags field we can put the node status: Data server,
on-line, offline/outdated.

c. Things to consider
i. All the clients must be out (perhaps is enough ‘deny new
transactions’ shutdown mode)

In other words ‘Secondary attachments cannot create clusters’ – else the
already connected users to the database which is currently promoted to a
cluster will remain attached to ‘Server-A:C:\MyDb…’ and the new ones
will connect to all ones, thus generating a data de-synchronization
between the nodes.

3. How it works (an example)
a. In normal conditions

When a client (let name the clients CL1, CL2, ..., CL100) will connect
to one of these databases who now are nodes in a cluster, the server on
which is connected will send back the new ‘connection string’ which is
formed from all of node clusters and then we’ll have, behind the scenes,
an app with the following connection string:

Server-A:C:\MyDB\DB01.FDB; Server-B:C:\MyDB\DB01.FDB;
Server-C:C:\MyDB\DB01.FDB

(of course, the developer typed in the TIBODatabase.Databasename the
‘Server-A:C:\MyDB\DB01.FDB’)

The client will send all the commands to all the three servers. The DML
commands are obvious; the SELECTs are in order to ensure the hot-swap
capability (in the case of failure we need the contexts to be in sync).
IMHO, this architecture are better than a “relay” or “emitting” server
which, after updating its data, will replicate it on the other nodes,
also from performance and complexity POV: server NIC will have a heavy
traffic each incoming data chunk must be sent back to other nodes,
server CPU which must do the replication process, heart beating the
other nodes yielding their state, replicate the contexts over the
cluster nodes, in other words the possible cursors, selects, selectable
stored procedures in progress, transactions ID aso. Also, the
replication process tend to be serialized since is the same engine which
does it. (Of course you can do it on one thread/cluster node but the
simultaneous changes from many clients will remain serialized/each
thread). (We have “not so nice” experiences in this area with Windows
clustering, anyway a nice try and big amount of work from M$, but the
clustering engine was to slow and ‘fuzzy’ for our purposes) Also, using
a DLM with a storage quorum in middle will require expensive hardware in
order to make things fast and, also, introduces a new point of failure:
the shared quorum storage. That’s why I propose a Majority Node Set
cluster implementation in favor of Single Quorum Device cluster. But,
this implementation doesn’t exclude the DLM approach. Also, the Majority
Node Set model can be used on geographically dispersed nodes, which
isn’t the case with the DLM.

The first server from the connection string (the Server-A) will be, in
the normal case, the data server and will return the data (rows from
SELECT, rows from selectable procedures, output parameters aso.). This
will be established at the login time. The client will supply at each
cluster node a login structure like:

Username: string;
Password: string;
<…many other things…>
LoginType: integer; //0 – normal mode (data server), 1 – mirror mode
(normal cluster node)

And will have locally a list with nodes like:

Server
<…many other things…>
ServerStatus: integer; //0 – online data server, 1 – online mirror
cluster server, 2 – offline/failed node

If a client issues a ‘select’, if the LoginType will be 0 then the
server will act normally. If the LoginType will be 1 then the server
will return ‘ok’, begin executing the query (in a separate thread?) and
at each ‘fetch’ issued by the client will advance through the result
set, updating its buffers, returning ‘ok’, without returning the actual
data over the wire, due to performance issues. The client will expect
only the ‘ok’ packet due to ServerStatus member from the structure.

b. Election of data server

At login time the clients will try to connect to all servers provided in
the connection string. If the logins are successful, then each client
will query each server for the CURRENT_TRANSACTION environment variable.
All the nodes which have CURRENT_TRANSACTION <
Max(CURRENT_TRANSACTION(<Cluster>)) are marked with ServerStatus=2
(offline/failed node). If are more nodes that have CURRENT_TRANSACTION =
Max(CURRENT_TRANSACTION(<Cluster>)) then check how many clients are
attached on each node. Now, based on a configuration parameter stored in
.config file we have two policies: ivory-tower and brotherhood (sorry
for the ‘odd’ naming…). In the ivory-tower policy, the node which will
have the maximum number of clients attached on it will be the data
server. This is in the case in which we have very safe / very good /
very fast server compared to the other nodes. In the brotherhood policy,
the node which will have the minimum number of clients will be the data
server. This is useful for load-balancing, when the servers are close in
technical specifications, network speed, availability etc. If more nodes
will have the same, maximum (or minimum, depending on policy) number of
clients then the data server will be the first of these nodes which
appear in the connection string. (The client will update both
ServerStatus locally and LoginType on the server). Is important for the
client to send the commands to all nodes simultaneously (in separate
threads) in order to ensure that at any moment in time, there aren’t
discrepancies between TIDs between the nodes. This is necessary in order
to not mark as ‘offline’ nodes which are in fact ‘on line’.

c. Failure of data server

Mark the ServerStatus of the corresponding server with 2 (offline/failed
node), kill the sending thread to the failed node (the old data server),
do a new election of data server, and the client library must be able to
continue fetching data (rows, parameters) from the new data server. The
DML commands (if they exist) should continue on the other mirror sending
threads. Breaking error must be raised only if no more nodes are available.

d. Failure of mirror server

Mark the ServerStatus of the corresponding server with 2 (offline/failed
node) and kill the corresponding sending thread.

4. Cluster gateway

Perhaps is very good to have an optional program which will act like a
gateway for the cluster. In fact, it will be in fact a middle tier,
between clients and the cluster nodes, routing the client’s requests to
the active cluster nodes. It will have the same engine as above. How it
works:
a. The gateway (located on ‘Server-R’) reads (from a configuration file)
the listening port (let’s say that it is 3050) and the ‘routing table’
in the form something like:
Server-R:C:\Fdb\DB01.FDB=Srv-A:C:\Fdb\DB01.FDB;Srv-B:C:\Fdb\DB01.FDB
Server-R:C:\DB02.FDB=Srv-C:C:\DB02.FDB;Srv-B:C:\DB02.FDB

b. When a client will connect to the Server-R:C:\DB02.FDB the middleware
using the protocol described above will connect to Srv-C and Srv-B and
thus will transparently serve the clients. Any cluster node state change
will be logged somewhere.
Things to consider:
a. The speed of sending commands over the wire will be increased between
clients and the gateway. This is useful in rare cases in which we have a
slow connection (read: Internet, wireless) and we have to upstream much
data (BLOBs mainly)
b. The gateway is a new point of failure. That’s why it should remain
optional. Only who wants to use it, should enable it.

5. Things to consider

a. The cluster is not to be spread on different segments

Consider the following network configuration: (for those of you which
don't see clearly the 'drawing' change the font to a fixed-size one e.g.
Courier)

CL1 Server-A
| |
CL2 - Switch 1 ----- Switch 2 – Server B
| |
Server-C CL3

Let’s say that the data server is Server-A
If appears a network failure between Switch 1 and 2 (the cable is
broken) then CL1 and CL2 will elect as new data server the Server-C and
CL3 will remain with Server-A as data server. Such network
configurations should be generally avoided or use a gateway (see above).
This is a general problem with clusters. But, IMHO, I rather prefer to
have a switch as a single point of failure rather than an application
(the gateway which is, in fact, the middle tier between the components).
Also, if the switches support (R)STP ((Rapid) Spanning Tree Protocol)
and are configured properly, then we have no point of failure.

b. Very simple configuration

In other words, there isn’t something like a ‘cluster console’ a
centralized cluster administration, (though it can be easily built), and
the cluster can be made (through a select on RDB$Files) isn’t
transparent to developer (is transparent only to the user), thus leaving
to the developer the entire responsibility (and flexibility) to
configure it through API or (please implement) SQL commands.

c. No administration costs, near to 0 down time

On the other hand, the administration costs are (near to) 0 and
‘rebuilding’ the cluster means in fact a ‘simple’ file copy (here, of
course, a variant of CREATE SHADOW will be much indicated, due to issues
with data corruption, shutting down all the servers etc.). Also, if a
SQL command (based on CREATE SHADOW engine) will be implemented in order
to copy the data to the new/restored nodes then we’ll have (near to) 0
down time. If you use a gateway, this can be automated.

d. The developers can build ‘x’ clusters on ‘y’ servers

Because the clusters are tied to databases, the developers can use the
Firebird servers for building different clusters with as many nodes they
want. For example:

In application Erp.exe we’ll have the following connection string (built
by the client from RDB$Files):
Srv-A:C:\ERP\2007.FDB;Srv-B:C:\ERP\2007.FDB;Srv-C:C:\ERP\2007.FDB

On the same system in the Chat application we’ll have:
Srv-B:C:\FDB\Msg.FDB;Srv-C:C:\FDB\Msg.FDB

On other app we’ll have only:
Srv-C:C:\FDB\ErrorLog.FDB

As you see, a developer/DBA can use, depending on his needs, as he
wishes the servers which he has.

e. Architecture independent

This works for all configurations: super server, classic and embedded
and on any platform (and mix of platforms). Anyway, for embedded, I
don’t see much use…

f. Advanced load balancing

Yes, is nice to have but the simple load balancing used by distributing
the clients to the minimum used server (each client having other data
server, going on the most free one) will do a good job. Also, on
Windows, at least DFS is implemented in this way. I don’t know for
Linux. For advanced load balancing (at the first glance) it can be
implemented, but then you must have a cross-platform remote performance
counter yielding engine On Win NT cores (at least XP/2003 it exists, not
sure for Win 2000, don’t know for Unix/Linux). Or you can provide only
the changing the data server temporary based on an API function. Anyway,
if this isn’t implemented having a failure-tolerance/simple
load-balancing clustering is a big gain already.

g. Reliability

Because the transactions are started (and ended) simultaneously (no new
TID is issued for any node until the other nodes catch up – the client
must wait for the nodes) it ensures that the cluster is always in a
stable state giving us a very reliable fault-tolerance cluster. On the
other hand, one of the cluster bottlenecks is the speed of the slowest
connection, but if this is an issue you can use the gateway. Also, the
system has only one point of failure: the main switch (which is a
hardware piece) and/or the gateway system (ie. the application itself,
the computer aso.) if is used.

Hth,

m. th.

----------

PS: Any comments? :)