Subject Re: [Firebird-Java] reuse of connection after failure.
Author Andrew Goedhart
The real problem is that connection keeps on being returned on request for a connection from the Datasource at start of the subsequent jobs. It is no longer usable. When the job tries to use the connection it dies horribly. This keeps on happening. The only way to recover is to kill the server and reload. We have +- 30 other connections attached to the database form the current and other servers in the cluster. These seem to be okay.

This also happens on a newly restored database. The database is rather large: 180+GB and growing rapadly. last count the readings table had about 1+ billion records, so I guess we are working Firebird hard. I actually don't mind loosing one or two readings (I never said that :-) from the incoming streams if I can keep the servers up. By the way what is the maximum number of records per table for firebird 2 servers. Firebird 1.5 had a limitation of 4 billion records per table. Has this changed or do I need to start looking for a solution. Only half of my units are reporting into the new system currently.

Since the start of the discussion we have changed the JMS server to kill the worker thread and creating a new one. This seems to help. It seems like JBoss is tracking the transactions per thread.

Previously had huge problems with how Jboss handles failure on the JMS cluster and between ActiveMQ and Firebird. This is the reason for the XA transactions. This means that we load our custom JMS bean server as a SAR in JBoss and do manual XA Transaction control. We are most likely initiating the roll back when we realise that something has gone wrong with the current job.

Local transactions seem to be a problem. I don't know why, but if I set the datasource to use only local transactions, after a while Firebird starts deadlocking and not recovering properly. After about 1 hour it snowballs and everything grinds to a halt. This is generally around the status table whose records are constantly being updated. I don't know why XA transactions avoid this but they seem to. With XA transactions, provided that we don't get the link problem above, I can run the cluster for days before something else kills us. The status and vehicle tables are also one of the few tables where we need to use explicit locking to serialize access to records and avoid continual roll backs due simultaneous updates.

We currently run gfix as a cron job in the background. Limbo transactions tend to cause page load failures in the web interface. So we tend to try catch them as soon as possible. (the gfix is run every minute or so to clear limbo transactions. This currently does not seem to have a negative effect, but welcome comment on the practice)

If I understand you correctly, you are saying that the connection pooling in this case is handled at the Jboss level and not at the firebird level ? This means that I have just been lucky for the last few hours:-) and that my changes to the defaults have no effect.

Maybe the thread suicide is doing the job when it discovers an unrecoverable error. ?

Any idea how you can force JBoss to kill a connection and not attempt to reuse it or recover. It seems that the recovery is failing in Firebird.

By the way how does one turn on tracing with Jaybird. Using the command line system property on Jboss and including org.firebirdsql = DEBUG in the log4J.xml file does not seem to be enough.



Andrew









----- Original Message -----
From: Roman Rokytskyy <rrokytskyy@...>
To: Firebird-Java@yahoogroups.com
Sent: Monday, March 12, 2007 5:47:02 PM GMT+0200 Africa/Harare
Subject: Re: [Firebird-Java] reuse of connection after failure.






Hi,

> Maybe a red herring but we have notices one or two SIGSEV in the fire
> logs. We are running CS 2.01 RC2 on a separate machine. The problem
> seems to happen more frequently when we are under heavy load.
> Approximately 50 000+ transactions (1 000 000+ queries/inserts) per
> hour. The system may run normally for hours/days and then trigger this
> on one machine in the cluster. Under heavy load it may trigger in 30 or
> so minutes. I am guessing when Firebird sigsevs we loose the connection
> and the connection pool is not recovering properly.

Yes, this is what happens.

> After changing the
> defaults on the connection pool (hard coded see below) so that it tends
> to throw away connections more often maxIdleTime = 15 and pings
> connections, we seem to recover easier.
>
> org.firebirdsql.pool.FBPoolingDefaults
> public static final int DEFAULT_IDLE_TIMEOUT = 10 * 1000;
> public static final int DEFAULT_PING_INTERVAL = 5 * 1000;
>
> org.firebirdsql.pool.BasicAbstractConnectionPool
> pingStatement = "SELECT CAST(1 AS INTEGER) FROM RDB$DATABASE";
>
>
> My real question however is how to get these hard coded defaults into
> the JBOSS firebird-ds.xml data source configuration file, so that I can
> use standard releases in the future

This can't work. The issue is that in your configuration you're using the
JCA connections, which have nothing to do with our pool.

The exceptions that you see show the following:

1. JBoss tries to start a transaction. Since the FB crashed and current
socket was already closed, Jaybird throws an exception (not shown in your
stack trace, but it is something like "Cannot connect to server" or "Error
reading data from connection").

2. JBoss creates a new connection and system runs further.

3. Transaction Coordinator from JBoss notices that some XA transaction is
not yet finished (the one, during which FB crashed) and tries to rollback
it. It uses the corresponding mechanisms of XA-enabled connections
(rollback method with Xid but with no managed connection).

4. Jaybird connects to a restarted Firebird and tries to end the in-limbo
transaction.

Now the interesting part happens. Here's the code:

try {
if (tempLocalTx != null && tempLocalTx.inTransaction())
tempLocalTx.commit();
} finally {
if (tempMc != null) tempMc.destroy();
}

The error you specified happens in tempMc.destroy(). This managed
connection was used to end the in-limbo transaction. For this it also
started new transaction (it makes a database query to find in-limbo
transactions) which it should close before destroying the managed
connection. Since it did not happen, it looks like something happened
before.

Now the question is - what? You can check whether the same exceptions
happen when you manually kill FB during low load. If this happens to be
reproducable, we can track this down.

In general, what you see is not something too dangerous. The worst thing
that can happen are the in-limbo transactions (which will inhibit the
garbage collection), but you can fix that with gfix. The second stack trace
shows only that JBoss was unable to complete the in-limbo transactions
automatically.

And, finally, if you don't really need XA transactions, switch to local
transactions. They are more lightweight also for Firebird (and JBoss too).

Roman