Subject Re: [firebird-support] Database corruption...
Author Ann W. Harrison
At 11:25 AM 3/31/2004, Jonathan Neve wrote:

>Could anyone please give me a detailed explanation (or perhaps a paper)
>about the way FireBird stores data?

Check I've written quite a lot about the on-disk-structure
and the mechanisms used to manage it.

> This may sound a bit vast, but what
>I'm interested in specifically, is understanding how, when an update is
>performed, it is written to the database, and how exactly the
>multigenerationnal system works.

Unh, lets see. Updates are written to disk through the disk controller and
disk head. The multi-generational system works well... Ok, more detail.

The database engine reads pages, fixed length blocks from the database file
and stores page images in the database cache. When a transaction looks at
a record, the engine locates the record on a page image and rehydrates it
into a record buffer. Records are compressed on disk. The lower parts of
the database worry about pages. The higher parts worry about record
buffers. The transaction makes whatever changes it wants in the record
buffer. Then the higher parts of the engine perform required trigger
actions and constraint checking on the record buffer. When all the
"before action" triggers and constraints have been evaluated, the engines
"forebrain" turns the result over to the engine's reptilian brain
to store. This is the part you're interested in.

Every record version has a record header which contains the id of the
transaction that created the version, some flags, the format number - damn
it, yet another new concept - when you change the number or type of fields
in a table, the table gets a new format number. The database maintains
definitions of all the formats a table has had. When you read a record,
the engine checks the format, and if necessary converts it to the newest
format - returning to the moutons, the final element in a normal record
header is the pointer to the next version back.

The engine compares the new record version with the old, to determine
whether the old version should be stored in its entirety, or can be stored
as a delta. A delta is the information necessary to transmute the new
record version into the next older version. For example, skip the first 40
bytes, change the next three to "ABC", skip the next 150 bytes. Much
shorter than writing the whole 193 bytes.

The new record version MUST go on the same page that held the previous
version. If there's enough space on that page for both versions and the
old version is a delta, the system writes the delta first and makes a
page-line index entry for it, then writes the new record on the page and
makes the original page-line entry point to the new record.

What, you say, is a page-line-index? A data page consists of a page header
that says what type of page it is and other generic stuff, plus how many
records it holds, what table it belongs to, and some other data-page
specific stuff. After the header a data page contains an array of
offset/length pairs (aka page line index) that correspond to the records on
page. The actual data starts at the bottom and grows upward. When the
data section bumps into the page line index, the page is full.

Page line indexes separate the physical location of data on page from the
information necessary to locate it. A records identifier, the thing stored
in indexes, is comprised of the page number expressed as the sequence
number of the corresponding pointer page, the offset on the pointer page
that holds the data page number, and the offset in the page-line index that
points to the record.

Returning to our moutons, switching the page line numbers fixes up the page
so that the new version is first in line to be found from the index and all
is well.

Unless, of course, the back record doesn't fit. Then space must be found
on another page for the back version. The back version is written
there. That page is clean and can be written. Then the engine stuffs the
pointer to the back version into the record header for the new primary
record version, writes that record version to the page image, fixes up the
page-line entry and off you go.

Unless of course, even without the old record version, the page is too full
to hold the whole new version. Then the flags in the record header are set
to indicate that the record is fragmented and the record header is extended
to include a fragment pointer.

At no time during this whole process are there ever pointers to off-page
data that is not complete. At no time during the whole operation of the
database is there ever a case where a transaction must hold two dirty
pages. Never, ever, is a page written with incorrect pointers.

>If I perform a select, how does
>Interbase know which version of the record to show me, etc?

It keeps an array of the transaction id's your transaction can see and
picks the first record version with a transaction id in that range.

>The reason I would like to understand this more fully, is because I'm
>wondering how it would be possible to make FireBird 100%
>corruption-proof. In other words, how can we make every single disk
>operation atomic?

It is. That's what careful write is about. And we have it.

> Naturally, there is already a degree of atomicity due
>to the transaction system, but what would be good, would be to make
>every single write atomic; that is, if it gets cut off in the middle,
>have some flag somewhere, or something that can tell us that the write
>wasn't completed, and that it can therefore be treated as though nothing
>had happened at all.

That's the way it works, with the single exception that a partial page
write is would be a problem. But they don't occur.

>For instance, if I perform several inserts into a table of my database,
>within the scope of my transaction, these inserts are atomically grouped
>together, and I either commit them all, or roll them all back.

Right. Database 101. ACID and all that.

>Therefore, if the client disconnects abruptly after the first insert,
>everything will get neatly rolled back. However, if, during one of the
>inserts, there's a power failure, the record will only partially be
>written to disk, resulting in an incorrect structure, worse than if
>nothing had been written at all. We need to have a way of telling that a
>certain operation was not completed successfully, and can therefore be

The worst that happens is that you get some orphaned back versions, no in
use. Or if the problem happens in some structural area, you may get an
orphaned page. Anything else is a bug. Plain, simple, ugly bug.

>Of course, I also know that there is an option called ForcedWrites,
>which, if it is used (which of course, it is on all my databases), makes
>the database perform synchronous writes instead of asynchronous writes.

Unless the database can determine the order in which pages are written,
careful write is impossible.

>This helps a lot of course, because it greatly reduces the likelyhood of
>there being unwritten data pending in the event of an abrupt shutdown.
>However, it's not good enough in many cases. It's still theoretically
>possible to have corruption due to a power failure.

Oh yeah? Under whose theory? (Ann, dear, calm yourself.) Where have you
found errors in the theory or practice of careful write?

> Also, a badly coded
>SP of trigger can make the database to loop indefinately. In such an
>event (I've had this happen to me), the only solution is to kill the
>server process (killing the client isn't enough).

Sure. The server dies, the server is reborn, long live the server. And
the database. (Ann, dear, you're really not doing much of a job of calming

> If the SP was
>performing database writes, and not mere reads, it's very likely to
>corrupt the database, for obvious reasons (this is what happened in my
>case). And of course, there are other ways of corruption a DB as well.

Obvious to whom?

>Do you think it would be possible to design the system in such a way
>that every write is entirely atomic?

Yes. I even have an existence proof.

>And if so, would this entirely
>solve the problem, or am I missing something. It seems to me that in
>such a case, we could even afford to turn ForcedWrites off, without it
>affecting the stability of the database at all.

No. Inter-page dependencies are a fact of life. By avoiding loops in the
dependency graph, you can establish an order of page writes that will never
create a dangling dependency. But if you let the operating system choose
the order of page writes all bets are off.

> Unfortunately however,
>my current understanding of the internal structure of the database is
>too limited to enable me to seriously speculate in this area...
>Any ideas?