firebird-support - Interbase 7.x crashing randomly

Subject	Interbase 7.x crashing randomly
Author	Jason
Post date	2011-08-15T10:26:09Z

Hi All - long time since I have posted here, been lurking though :-).
After reading this you may prefer I crawl back under my rock.
I have a number of sites running on verisions versions of IB 7.x. We
get a fair array of performance problems and admins letting servers run
out of space, not backing up...., but we have a couple of sites that
have knarly problems and one that I want to ask about. Server = MS
Windows server 2003.
The server abends (-1) every so often, e.g. it will go weeks without
failing and then suddenly abend a couple of times. It seems to come in
waves and I can't seem to see a pattern.
We have another problem as there are 2 levels of admins, those onsite
who can do some things on the server and others who are offsite who seem
to pull levers as they see fit (e.g. AV / backup). So I am looking for
causes and dignostic paths to fix. This is an end of life legacy
system, so no massive budgets to re-write software / replace with a
different version of IB/FB. I am fairly well versed in setting up /
maintaining and supporting IB/FB, but am scratching my head with it at
the moment.
Usage pattern: - 20 - 50 concurrent users- very few users at night, but
it is a 24 x 7 x 29 opp (they backup and restore once a month).- Some
bots reading (automated users polling for changes / populating 3rd party
DB's an kit)- Some bots writing (need to confirm this though)
Failure pattern:- not proportionate to usage, e.g. a lot of failures
appear to happen in the middle of the night- tend to cluster e.g. 2 or 3
in a couple of hours- some white outs (when server not crashed, but so
slow to respond that users get admins to restart service)
What we have tried- excluding AV from DB folder / IB exe folder / temp
folder- ensuring enough space on drives- exluding backup from dbfolder

Gut feel is that it is a non-ib process accessing a file IB cares about
whilst it is accessing it (gdb). The trouble is we don't control the
whole server. By looking at the event log for the login events we had
identified a 95% correlation between a system user (backupUserTest of
all users) logging in and the system failing. Consensus was that this
may be caused by a rogue process / virus, it seemed to log on the split
second the crash happened. This could have also been an effect and not
the cause (e.g. process crashes and a dr watson type process in created
and authenticated to do some house keeping).
Next was HW, it was running on very old kit, so we VM'd it and put it on
a reliable box. Still it has crashes, not sure about the frequency.
I had wanted to use sysinternals to monitor the GDB file to see if
anyone else messed with it, but it was felt that this would be too much
of a performance hit, but maybe they will reconsider this now.
Now I amn working on a sysinternals command to scrape stats about:1) Who
is logged in2) monitor temp folder3) Get snapshot about process / cpu /
disk que / RAM usage etc4) filemonitor on the temp folder and the gdb
file
Have these being logged to a file and deleted every few hours (I would
imagine that they would grow pretty quickly).When the server stops, I
think you can run a command which would be to rename all the log files,
so that you have the last period prior to the crash.
Does this seem sensible? Any pointers on tools / techiniques which may
enhance this.
Also, wanted to use a SQL monitor to scrape all TCP/IP interaction, but
some of the very old robot processes are DOS / java / delphi and may
have local connections and also run from this same machine. I look at
robots being a potential source of the problem as they are active all
the time. I didn't see IB's performance monitoring as the chance of
catching it at the point of failure would be very small.
Finally, to exclude all external influence I had thought of:1) Setting a
new OS user for IB2) Give perms to this user to the IB progs / gdb path
/ backup path / external tables path / dll path / temp folder3) remove
all other users (including system) to the gdb path to begin with
Does this seem feasible / sensible.
As crashes are intermittant (e.g. up to 3 weeks betwen failures),
whatever I put in place will need to run for a long time and not require
a lot of daily management.
Sorry for the long whaffle, but we have been struggling with this one
for a long time. Normally I can completely isolate the servers, even
put them behind a physical NAT router with specific ports open to remove
all third party variables, but this server sits in some data centre
somewhere.

[Non-text portions of this message have been removed]