Subject Re: FB 1.5 - Continuing database corruption issues
Author Greg Kay
Hi,

We're running Firebird Classic 1.5.3 and have a 25 Gb database with
about a million transactions per day. Occasionally we have the corrupt
index problem you mentioned. We pick this up in our nightly
backup/restore process where the backup part will fail. Usually we can
find the index by running a verbose Backup and checking where the backup
fails. When we get the index we delete it, run our backup/restore
process, then put the index back.

Just out of interest, is this problem likely to occur in Firbird 2.0?

--- In firebird-support@yahoogroups.com, "Bob Murdoch" <mailgroups@...>
wrote:
>
> I am running FB 1.5 classic (version WI-V1.5.1.4417) on a Windows 2003
> server with dual Xeon processors and hyperthreading enabled. The
> database is Dialect 1, and is approximately 35GB. We run with Forced
> Writes turned on. All access to the database is thru an application
> server with pooled connections, an ETL process with multiple threads
> making separate connections, and pooled ODBC connections from a web
> server. No direct user connections are made with the exception of two
> administrators. The total number of connections is usually between 10
> and 15, with a peak of not more than 30 at any one time.
>
> We have been having periodic problems with corruption. In almost all
> recent cases, the corruption is discovered by an error message
> received by a client process. The nature of the message is "wrong
> page type - expected 7, found 5". To give you an idea of how often
> the error occurs, here are the last few dates - Nov 16, Oct 22, Oct
> 17, and Jul 17. Prior to that, we had an issue maybe once per year.
> The database has been in use for five or six years now, across three
> different machines. The current machine has been in use for two
> years.
>
> In the past, we have taken down the database, ran gfix -mend, and
> proceeded with a backup/restore cycle. This entire process takes at
> least 12 hours, and of course means at least a full day of downtime
> for all systems.
>
> We have recently gotten a little smarter about the process, and have
> found that deactivating/reactivating all user-defined indexes on the
> problem tables seems to put everything back in working order until we
> can do the backup/restore/replace. Of course, the problem is that we
> don't know what tables have the problem until we run into it, but that
> is better than taking the db down for the two hours it takes for gfix
> to run.
>
> A little background on the db and its uses - we do a backup *and*
> restore every night. The restored databases are not moved into
> production, but are kept as a safety net for a week. There have been
> occasions where the backup was successful but the restore was not, so
> we have decided to play it safe and run both.
>
> Most of the data in the system comes from ETL-type processes that run
> each day. Our client has six different operating divisions, and the
> source data is handled seperately. In some cases, we delete the
> previous data for a division prior to loading the current data (all in
> one transaction), and in others we simply append the new data to the
> table.
>
> One of the tables that consistently has these index problems contains
> anywhere from 15k to 70k rows replaced each day for the six divisions.
> Looking back through our logs, we do not find any coincidence between
> failures of the ETL and resulting corruption. This table is probably
> the largest table that we replace data in. All others are quite a bit
> smaller, at least by half.
>
> I have had a suggestion by a colleague to add the "alter index
> inactive/active" process to our nightly scripts, thereby cleaning up
> possible issues prior to doing the backup restore. I certainly don't
> want to do this for all tables, and I am not convinced that this is a
> good practice to follow for even the one table mentioned above.
>
> As you can see, I'm at a bit of a loss to explain the problem,
> especially the frequency. The customers system admins have run
> diagnostics on the machine and have turned up nothing.
>
> Any thoughts on how to prevent this situation are welcome - from
> changes to our administrative processes to changes in our handling of
> ETL jobs.
>
> Thank you,
>
> Bob M..
>