firebird-support - Re: [firebird-support] How to locate and delete records connected to a bad sectors ?

Subject	Re: [firebird-support] How to locate and delete records connected to a bad sectors ?
Author	Ann Harrison
Post date	2012-11-01T14:28:29Z

Stephane,

Starting with your subject, you can't delete records in bad sectors. You
can drop tables that have data pages in bad sectors because those pages
aren't touched when the table is dropped. When dropping a table, Firebird
reads the RDB$PAGES table to find the pointer pages and index headers for
the table. It then reads the upper levels of the index to find the page
numbers for the lowest level and resets the bits on the page inventory
pages (PIPs) to indicate that those pages are free, and recurses a level to
release the next level up. When the indexes are gone, it reads the pointer
pages, setting the bits on the various PIPs to reflect that the data pages
are free. So, if the bad pages are index leaf or data, dropping the table
works fine. However, to delete a record, Firebird has to read that record,
create a deleted record stub, store the stub on the target page and, if
there's not room for both the record and the stub on that page, move the
record elsewhere. So if a data page isn't readable, you can't delete
records on it.

> On our database (FB 2.5) of 150 GB we have 21 bad sectors of 512 bytes
> (all in the same block). because of that the backup/restore not work (even
> with the -ignore option). And gfix not help us too much because it's freeze
> and crash the fb_inet_server (but it's was also the case before the bad
> sectors problem, so it's a bug in gfix i guess).
>
> Now i need to know with table/records are involved in the 21 bad sectors
> and how to delete them. I already locate one table involved (and
> successfully drop it, it's was not a very important table), but their is
> also another table involved that i need and can't drop it. so i locate
> exactly the record ID in error (by doing select until error appear) and try
> to delete them but without success because engine answer me :
> Statement failed, SQLSTATE = XX001
> database file appears corrupt ()
> -bad checksum
> -checksum error on database page 7456994
>
> what are now my option ?
>

The easy thing is to spend a little bit of money and buy FBDataGuard.

>
> i thing of manually update the checksum (by 12345) on the database page
> 7456994 (and the page type too), but i hope their is a more convenient way
> than hex edit my database ?
>

Making the checksum 12345 on a page that's corrupt is certainly going to
lead to bad things. The first error you'll probably see is "Bad page type.
Expected 5 encountered 497" or something like that. So you go back with
your hex editor and fix the page header so it looks more or less like a
data page. But the data page is structured. At the top are an array of
length/offset words. They're going to be random values, with lengths and
offsets that exceed the page size.

If something had just lightly screwed up the pages - say going through and
writing a bad byte here and there, you might be able to unscramble it. But
a bad sector is garbage. You can't make it back into roast duck.

>
> ...So, they said, OK, 12345 is different from zero, so we'll still catch
> the same errors, but we won't waste time computing something that never
> fails."
>
> it's mean that engine report error only if the checksums is different from
> 12345 (and i imagine the checksums is written only on the page header not
> on every disk sector of 512 bytes). as my page header is 8kb and my bad
> sectors is only 512 bytes, what happen if the page header is not on a bad
> sector (so page will return good checksum) BUT the the page content contain
> bad sector ??
>

Right. The checksum is written only in the page header and it is possible
to have a bad sector that doesn't include the first 512 bytes of the page.
The errors you'd see if the latter part of a page were corrupt depend on
the page type.

If it's a data page, the most likely result will be a "decompression
exceeded buffer length" error - I may have forgotten the exact syntax of
the error. The first part of the page is an index to the record versions
on page. We'll assume that's part of the first 512 bytes, so it's OK, but
the locations it points to are not. Every record version consists of a
record header that has the format version, the id of the transaction that
created it, some bits that say whether this is a fragmented record, and a
pointer to the next older version. That stuff is pretty dense, so the most
likely error you'd see would be "unknown format version" - if say the
format version in the the garbage pretending to be a record header were 233
and the latest format for the table was 12. The data is run-length
compressed, and expanding garbage with that algorithm oftne leads to a huge
amount of expanded garbage.

If it's a pointer page that's got a garbage sector, you'll see errors like
"attempt to read beyond end of file" - meaning that the bytes which should
be a page number look like something huge - or "wrong page type", meaning
the value was in range but random. With extremely bad luck, you might get
a page that's actually a data page in the table you're reading, but not the
one it should be.

If it's a PIP that's bad... well, Firebird trusts the bits in the PIP to
say that a page is free or in use.... and random bits look just as good as
real bits.

But returning to your actual problem. If you don't want to buy FBDataGuard
and dump and reload your whole database, you might see if IBSurgeon can
remove page 7456994 from the pointer page. It's vaguely possible that the
people at IBSurgeon could find and reconstruct records from the good
sectors of the bad page - it's time-consuming and fiddly work, but since
the format describes what the records are supposed to look like, an expert
can find the sections that decompress into plausible record data. Depends
on the value of the data to you.

Good luck,

Ann

[Non-text portions of this message have been removed]