Subject Baffling gbak hangups
Author Steve Friedl
Hello all,

I support some line-of-business software that uses Firebird, and out of
dozens of Linux database servers running 2.0.4 Classic (64bit), *one*
of them has sporadic hangups during backup.

This software keeps data in multiple same-schema'd GDB files (one per customer),
and every night a perl script runs a raft of commands to back the production
data up to a backup area.

gbak -backup \
-transportable \
-limbo \
-garbage \
-ignore \
-user SYSDBA \
-pass blah \
localhost:/software/FILE_118.gdb \
/backup/FILE_118.gbk

Out of all the db servers I support, with tens of thousands of GDB files to back up in this
same manner every day, I've never seen this happen before even once. But on this server,
it's happening consistently - about 4 nights out of 5.

And it's a different GDB file every night !

Things that had no apparent effect:

* Rebooting
* Running a separate backup/restore on all the files
* Upgrading to Firebird 2.0.5 classic (64bit)

All RPM installers at all my customers came directly from the Firebird site.

Machine in question:

* IBM System x3650
* dual quad core 3.2GHz
* 20G RAM
* 64-bit CentOS 5.4
* Stripped down install: no X11, only purpose is to serve DB.
* No users login routinely (application itself runs elsewhere)
* Not a hint of anything in any log

This is a common configuration among my customers.

During the hang, neither the gbak no associated fb_inet_server processes are chewing
any CPU time at all - they just sit there.

Now for the odd part: If I sent a SIGSTOP followed by SIGCONT to the gbak process,
it picks up and finishes the backup in short order.

During the just-sitting-there-time, the output file

-rw-r--r-- 1 root root 9699328 Jul 2 21:17 /backup/FILE_118.gbk

(send STOP + CONT)

A few seconds later the file is finished:

-rw-r--r-- 1 root root 14055424 Jul 3 23:54 /backup/FILE_118.gbk

If I restore the file, it appears to be intact, so the backup looks like
it's doing the right job.

I only found this SIGSTOP/SIGCONT thing because I tried running "strace"
on the processes, to see what they were doing. It didn't show anything,
but I found the gbak was in a STOPPED state - this is a documented side
effect of strace sometimes - and I found that restarting after let it
finish.

I'm just at a loss to figure out how to debug this, and hope that somebody
else has some pointers. I'm comfortable with source-level development
and digging into Linux as deep as necessary.

Thanks!

Steve

---
Stephen J Friedl | Security Consultant | UNIX Wizard | 714 694-0494
steve@... | Orange County, CA | Microsoft MVP | unixwiz.net