Subject Re: [firebird-support] FB 1.5.4 engine unexpectedly terminating
Author Bud Millwood
On Thursday 22 November 2007 10:46, Alan McDonald wrote:
> > Hi,
> > in the last weeks we experienced several unexpected crashed
> > of a FB 1.5.4 engine on a Windows 2003 Enterprise SR1 cluster.
> >
> > The MS Service Manager logs:
> >
> > The Firebird Server - DefaultInstance service terminated
> > unexpectedly. It has done this 1 time(s).
> >
> > On termination which in itself is not verry helpful.
> > The firebird log at the time of the shutdown shows plenty of messages:
> >
> > AHA20 (Client) Wed Nov 21 21:19:09 2007
> > INET/inet_error: send errno = 10054
> >
> > AHA20 (Client) Wed Nov 21 21:19:09 2007
> > REMOTE INTERFACE/gds__detach: Unsuccesful detach from database.
> > Uncommitted work may have been lost
> >
> > AHA20 (Client) Wed Nov 21 21:19:09 2007
> > INET/inet_error: connect errno = 10061
> >
> > of the 10054 there are 20, of the 10061 31 and the
> > unsuccesful detach is single entry in the log. The amounts
> > vary from crash to crash. Other network communication to and
> > from the clusternode running the FB server ssems not to be
> > interrupted at that time. At other times there is an
> > occasional 10054 error but nothing that stands out.
> >
> > What could possibly be the source of this ?
> >
> >
> > Additionally I would like to get a DR. Watson trace on this
> > crash should it happen again and be something that generates a dump.
> >
> > What would be the performance penalty of using the pdb
> > enabled debug version of FB to get that trace ? Is there any
> > description on how to set the debug version ?
> >
> > Best regards
> > Bjoern
>
> As soon as I see 10054, I suspect the NIC on either server or client(s). If
> it's always one client then it's probably that client NIC if it's everyone,
> then probably the server NIC.

Not FB-specific: I think ICMP messages directed to a listening server can
generate 10054 errors as well, so it may not necessarily be a NIC as much as
just any general problem, including transient errors. On connectionless
sockets it may be best to simply ignore 10054 in the name of robustness,
since IIRC your socket will still be open and can continue to receive
datagrams.

Basically, for connectionless, you don't want a transient error to take
everything down. I'm not sure about connection-oriented, but generally
speaking transient errors should be designed for and handled robustly.

- Bud

Bud Millwood
Weird Solutions, Inc.
http://www.weird-solutions.com
tel: +46 8 758 3700
fax: +46 8 758 3687