Subject Re: [firebird-support] Intermittent Database Corruption
Author Helen Borrie
At 10:43 PM 4/05/2004 +0000, you wrote:
>Hello all;
>
>My company develops and supports an application that is in use by more
>than 300 clients around the US. All clients use InterBase 6.0.2 or
>FireBird 1.0.3 as the DB server. As we improve and maintain our
>system, we deploy various changes to both the applications and the
>database metadata with each release. A small number of our clients
>who happen to be all on the same (recent) version of our software and
>database metadata, but some of whom are running IB 6.0.2 and others of
>whom are running FB 1.0.3, are now having difficulty running gbak
>successfully. The backup for these sites tends to hang while backing
>up data for a particular table (EMAIL_ACCOUNTS). Gfix will sometimes
>report record and/or page errors, but mending, backing up and
>restoring do not fix the problem for long, if at all. Most often, no
>validation errors at all are reported by gfix.

One thing that you could look at, is whether the scripts you are using to
deploy metadata updates are a) being run without exclusive access to the
database and b) are applying DML to metadata that have been changed but not
yet committed. Either or both -- the two problems often go hand-in-hand --
are common enough causes of "soft" corruption, i.e. invalid data being
written to tables because data were altered while structures were in an
unresolved state. So, while gfix won't necessarily find structural
corruption, gbak will bounce invalid data written during this transitional
state.


>We have also been able to get the backup to complete successfully by
>deleting certain constraints temporarily. This change to metadata
>seems to allow the backup to complete within a reasonable, nearly
>normal time. Again, such repairs are short lived, as the database
>performance will again degrade and backups will hang, even when the
>database is restored into a new DB file.

If you are getting chronic performance degradation, it's a fairly good
indication that the regular background housekeeping is being inhibited,
usually by repeated DML hits on the same rows by multiple transactions,
without hard commits being performed. (Hard commit = commit is explicitly
called, allowing cleanup of old record versions; soft commit = Commit
Retaining being called repeatedly with no periodic hard commits: prevents
housekeeping from being done.)


>When the backup operation has "hung" (gbak is running, but no
>progress is apparent) we often see that the backup file is not being
>written to, but that the GDB file is being updated. This seems
>peculiar and is something we have not previously been aware of; the
>backup file is always more recently updated than the database after a
>backup. This may go on for hours.

It's "normal". Backup isn't simply a matter of writing a backup
file. Gbak does housekeeping, i.e. explicit garbage collection, as part of
its task. If you have a huge accumulation of garbage that hasn't been
available for GC for the reasons outlined above, then gbak will try to deal
with it. You can switch off GC during backup by using the -g switch (for
"No garbage collection"). You'd typically do that when you are planning to
resume operations on a restored database. However, if you are going to
carry on with the same database file, you clearly need to have GC
happening, since your applications are inhibiting it.

>Normal backups typically take
>minutes. The affected clients' databases range in DB file sizes
>from about 280 MB up through 2.9 GB.

If you can't improve the transaction handling being done by your
applications, then you are going to need to implement some strategy to
mitigate the effects of neglecting GC. This, of course, involves
participation by your site admins, which isn't always desirable. However,
having SYSDBA or the db owner run a manual sweep on a shut-down database,
before backups and when performance degrades to a critical level, will
ensure that gbak focuses mostly on backing up and doesn't have to consume
major resources doing huge volumes of GC.


>Does anyone have any suggestions or experience with a similar problem?
>Is there any information about the sort of metadata updates that
>may cause DB corruption (other than those done with an in-use DB)?

Well, it's a "given" with the multi-generation architecture that performing
DML on uncommitted DDL is a recipe for Bad Stuff Happening.


>Can we somehow run the FB engine in a debug mode so we get more
>information in the server log about what the server is doing?

I don't think there are debug builds available for Fb 1.0.x, but there are
for Fb 1.5.

/heLen