Subject Re: [firebird-support] connection refused, fbserver processes rising
Author Mario Balfanz
Hello Helen,

thank you very much for your detailed thoughts. See my answers below...

Helen Borrie schrieb:

> At 01:43 AM 21/03/2006, you wrote:
> >Hello all,
> >
> >last week we experienced some strange problems with our firebird
> >server. Any connections were refused while the number of fbserver
> >processes is rising. Details:
> >
> >- FirebirdSS-1.5.2.4731-0 running on SuSE Standard Server 8 with
> >kernel k_smp-2.4.21-273
> >- Clients are various selfmade applications written in Borland Delphi
> >
> >
> >No connection possible. So again I did "rcfirebird stop" and "killall
> >fbserver". Here's the part of firebird.log from startup on saturday to
> >"killall" on monday morning:
> >
> >samba1 (Client) Sat Mar 18 17:31:48 2006
> > /opt/firebird/bin/fbguard: guardian starting bin/fbserver
> >samba1 (Client) Mon Mar 20 07:59:54 2006
> > INET/inet_error: read errno = 104
> >samba1 (Client) Mon Mar 20 07:59:54 2006
> > INET/inet_error: receive in try_connect errno = 104
> >samba1 (Client) Mon Mar 20 08:00:53 2006
> > /opt/firebird/bin/fbguard: bin/fbserver terminated
> abnormally (-1)
> >
> >Nothing in /var/log/messages or /var/log/warn.
> >
> >We have some processes on different clients that try to connect to the
> >database every few minutes: nagios and two of our own Delphi apps.
> >Something seems to go wrong in a way that every db connection gets
> >refused and each connection attempt leaves an fbserver process in memory.
> >
> >My questions:
> >- What does the error code 104 in the log mean? I also find some
> >errno=4 and 9 in the logs. On firebird.sourceforge.net I could only
> >find the SQL error table. Where's the documentation for the error
> >codes in the logfile?
>
> The INET errors are TCP/IP errors. 104 means that either the server
> crashed or a client crashed. Given the number of fbserver threads
> that are running when problems start occurring, and the fact that the
> problems started only recently, I would suspect a Denial of Service
> attack. You're starting to see network errors occurring once
> network, resource and Fbserver limits are reached. ~1000 concurrent
> connections is beyond the practical limits for SS.

OK. 104 = Something crashed... That's definetly true ;-) On saturday I
know that there were very few clients turned on. They threw messages
that the server was not responding, but none of them crashed as far as I
know. All clients started working without restart after the fbserver was
restarted.
I think I can eliminate the possibility of a DOS attack. The DB cannot
be reached from the outside, and on saturday I was the only one in the
company.
Is there any doc about all the error codes that may come from fb? Or do
you have to read the source code for that?

> The single-digit error codes are coming from SuSE. You should be
> able to find what's causing them by studying the manual.

Thanks, I will try to find something.

> Let's suppose that it's not your Delphi client applications
> themselves that are causing this problem...considering that you
> didn't have such problems before, throughout 2 years of usage.
>
> Because you are using TCP/IP, not all of those fbserver threads are
> necessarily client attachments. The server creates threads for
> various tasks of its own: garbage collection and some other
> stuff. And the server can allocate new attachments to a thread that
> is still "alive" and has nothing to do. However, if the number of
> fbserver threads is a lot higher than the actual number of connected
> users then *something* naughty is going on.

Hmm, interesting. Is there a possibility to turn on a kind of higher
loglevel? Maybe to see incoming/closing connections with their source
IP, creation/destroying of worker threads and so on? Didn't find
something in firebird.conf.

> If it is malicious, think about a recently-fired employee, especially
> if you didn't change the passwords when s/he left. Or an unhappy
> person in-house.

I don't think so.

> If it is accidental, then look for recently-hired users who have
> access to the database to run ad hoc queries or network hardware that
> is subject to intermittent faults (allowing a user to connect but
> losing the connection intermittently...Maybe a user who just started
> using a wireless LAN connection? or one who "cures" a slow query by
> crashing out of whatever query tool he is using? Several users with
> new machines/recent Windows upgrades, where faulty TCP/IP setup is
> dropping the LAN connection when the user is accessing the Internet?

Very few people here even know what a database is and that it may be
possible to connect with something else than the Delphi apps we give
them. Our company is a WLAN-free zone. All PC's have 100 Mbit wired LAN.
No network trouble known so far. We have a kind of branch office that is
connected with VPN over an unexpensive DSL line that has no fixed IP.
The provider cuts the line once a day to give it a new adress. At the
moment this occurs around 11:45, I checked this - so no correlation to
what we observed wednesday morning and saturday in the evening. And, as
I mentioned before, on saturday there was noone in the company but me.

> Basically, abnormally dropped connections take 2 hours to time
> out. Worse, if a connection is dropped while the request-processing
> subsystem is waiting to receive the balance of a packet that is
> larger than network packet size, e.g. a blob or a long SQL query,
> then the whole system will seem to freeze and will stay that way
> until keepalive has kicked in and the (timeout period - time of last
> keepalive packet received) becomes zero.

Ooops, that sounds dangerous, especially with our VPN/DSL connection
that may cause this kind of dropped connections. You really mean that
the whole system freezes? Not only the fb server? Then we luckily never
had this effect. This timeout period for that the system seems to freeze
in this case is two hours? Or did I misunderstand you?
When a connection was dropped, but without that worse freezing effect,
will this be logged? And how? Is it this errno=104?

Do you have any idea about my shutdown problem when I had to kill one of
the fbservers to complete the shutdown sequence? Maybe the cause is the
same...

> Finally, was Nessus added to the picture anywhere recently? Nessus
> is not friendly to Firebird 1.5.

No, nessus was not used recently.

> ./heLen

So, what is left to do? I hope there is a way to get some more logging,
so I would have the chance to see if one of the clients caused the
trouble when it happens again. On the other hand, I hope it does *not*
happen again.
I would be able to turn off the network monitoring tool nagios. This
would reduce the number of frequent connections to the db. But in this
case I will never know if nagios really caused the trouble. This is hard
to prove for an error that occurs only from time to time.
I could also update to FirebirdSS-1.5.3.4870.i686.rpm if you would
recommend that. But again, I will never know if the error is really gone
in this version or if it may happen again every second... I don't like
this feeling. I'd prefer to find the cause.

Kind regards,
Mario






[Non-text portions of this message have been removed]