firebird-support - Weird behaviour

Subject	Weird behaviour
Author
Post date	2017-02-13T18:21:51Z

Hi All,

I have encountered something that I would describe as weird behaviour, although I must admit that I'm not an expert on Firebird and cannot say whether this should be expected or not. So... first some basic info:
We have a system which relies on Firebird for persistent storage. The system is fairly mature however new functions are added as needed.

The system uses firebird 2.5.4 superserver, ibproviderv2 and MDAC 2.7, however when researching the issue I have updated our reference system to firebird 2.5.6 since source-code and pdb's were available for this version. The issue still occurs using 2.5.6. We are using WAIT transactions (e.g. the default transaction mode).

A thread is trying to perform a SQL select, a client wants to display detailed information for a certain row.
Basicly Select * from TABLE where PKEY = PKEY
The thread acquires lock in our server, then works it's way through GDS32.dll (fbclient.dll) and then times out on the call to WinSock2 select (I guess it retries if no success, because it never returns).
MON$STATEMENT shows a stalled SQL statement corresponding to the SQL statement above.

So it all boils down to is this behavior due to bad transaction management on our part or is it an issue with firebird?

Background + additional info:
I have been looking into an issue reported by one customer where the system seem to deadlock when working with a specific table. In the configuration used the system has four clients operating against a server which in turn utilizes Firebird.
Now the clients basically displays the table as a grid. The user can "open" a row in order to display more detailed information. When this occurs the server changes the state of the row (we have a column in the databese for this) to open. The user choose to perform an operation on the row (changing the same column) and closes the detailed view again changing status of the row. For each status change the clients will reload the row in order display updated information.

Now everything seems to be working OK when doing work from only one client. When working with two clients at a moderate pace things are also working OK. But at a high (to frantic) pace a deadlock will occur within 15 minutes. Which might suggest some kind of race condition. The thing is that these status changes I mentioned when describing the "work flow" can be done with the keyboard using a command sequence similar to [Ctrl+O], [Ctrl+O], [Enter] allowing the user to change the status 3 times in less than a second. And when I write high pace thats what I mean. So in order to reproduce the issue(at the office) I need two users hammering away at their keyboards for 15 minutes. Please note that the table in this case contains less than 10000 rows, sometimes as few as a 1000 rows. However at the customers site the problem is easier to reproduce. There the issue seem to occur when having one user working at high pace and one doing the work flow at lets say 15 seconds per row. The hardware is the same in these cases(model,manufacturer etc), the users are different though so maybe the reason why it's harder to reproduce at the office (the network should have the same topology etc.)

So, since this seemed to be a deadlock which I presumed occurred within the scope of our server I created memory dump which showed a deadlock where one server side thread had acquired a lock in our server and then seemed to be stuck on the winsock call 'select' but I could not get at decent stacktrace due to missing symbol files. I have since then changed the firebird to version 2.5.6/downloaded symbolfiles and also inserted some very basic trace-logging around the suspected culprit(the select call). I have also set the network timeout values in the firebird.conf since code review in fbclient showed these might be used as timeout parameter to the winsock2 select call.
I have also added a some code to make sure a blocking call to select doesn't occur (always set the timeout if not set). However the timeout used when reproducing the issue comes from the firebird.conf file. After these modifications I have deduced the following:

Info from memory dump and my tracelogs etc are available upon request.

Thank you in advance!

/John Karlsson