Subject FB 1.5 - Continuing database corruption issues
Author Bob Murdoch
I am running FB 1.5 classic (version WI-V1.5.1.4417) on a Windows 2003
server with dual Xeon processors and hyperthreading enabled. The
database is Dialect 1, and is approximately 35GB. We run with Forced
Writes turned on. All access to the database is thru an application
server with pooled connections, an ETL process with multiple threads
making separate connections, and pooled ODBC connections from a web
server. No direct user connections are made with the exception of two
administrators. The total number of connections is usually between 10
and 15, with a peak of not more than 30 at any one time.

We have been having periodic problems with corruption. In almost all
recent cases, the corruption is discovered by an error message
received by a client process. The nature of the message is "wrong
page type - expected 7, found 5". To give you an idea of how often
the error occurs, here are the last few dates - Nov 16, Oct 22, Oct
17, and Jul 17. Prior to that, we had an issue maybe once per year.
The database has been in use for five or six years now, across three
different machines. The current machine has been in use for two
years.

In the past, we have taken down the database, ran gfix -mend, and
proceeded with a backup/restore cycle. This entire process takes at
least 12 hours, and of course means at least a full day of downtime
for all systems.

We have recently gotten a little smarter about the process, and have
found that deactivating/reactivating all user-defined indexes on the
problem tables seems to put everything back in working order until we
can do the backup/restore/replace. Of course, the problem is that we
don't know what tables have the problem until we run into it, but that
is better than taking the db down for the two hours it takes for gfix
to run.

A little background on the db and its uses - we do a backup *and*
restore every night. The restored databases are not moved into
production, but are kept as a safety net for a week. There have been
occasions where the backup was successful but the restore was not, so
we have decided to play it safe and run both.

Most of the data in the system comes from ETL-type processes that run
each day. Our client has six different operating divisions, and the
source data is handled seperately. In some cases, we delete the
previous data for a division prior to loading the current data (all in
one transaction), and in others we simply append the new data to the
table.

One of the tables that consistently has these index problems contains
anywhere from 15k to 70k rows replaced each day for the six divisions.
Looking back through our logs, we do not find any coincidence between
failures of the ETL and resulting corruption. This table is probably
the largest table that we replace data in. All others are quite a bit
smaller, at least by half.

I have had a suggestion by a colleague to add the "alter index
inactive/active" process to our nightly scripts, thereby cleaning up
possible issues prior to doing the backup restore. I certainly don't
want to do this for all tables, and I am not convinced that this is a
good practice to follow for even the one table mentioned above.

As you can see, I'm at a bit of a loss to explain the problem,
especially the frequency. The customers system admins have run
diagnostics on the machine and have turned up nothing.

Any thoughts on how to prevent this situation are welcome - from
changes to our administrative processes to changes in our handling of
ETL jobs.

Thank you,

Bob M..