Subject | Re: [firebird-support] Database corruption... |
---|---|
Author | Ann W. Harrison |
Post date | 2004-04-01T17:33:29Z |
At 02:25 AM 4/1/2004, Jonathan Neve wrote:
rules for on disk layout and more much less simple rules that govern the
order in which pages are written. The major disk layout rule is that no
structure that can be written on two pages can depend on two-way
pointers. For example, a record version has a pointer to the previous
version. The previous version does not have a pointer forward to its
parent. The page that contains the old version can be written first. If
the page that contains the parent is never written, the database is still
consistent, though a small amount of storage will be lost until reclaimed
with gfix or a backup/restore.
A counter example are index nodes which have pointers both left and right -
pathetic attempt at a diagram:
0-first page on level -> <- second page on level -> <-third page on
level ->
The right hand pointers are reliable. If you follow the right hand pointer
from the first node on a level to the second, the page returned will be the
next page in that level of the indeex. The left hand pointers are not
reliable. If you follow a left hand pointer from the third page on a level
back to the second, you may get the second page, or you may get some other
page. There is code to handle the case where a left hand pointer is not valid.
Operationally, "careful write" is a question of maintaining a dependency
graph among pages in cache. When changes to a page could cause a loop in
the graph, writing some number of pages will break the loop. Those page
must be written before the original change can be made. When a page must
be written, all the pages it depends on must be written first.
5.6. One caused the database to write over the front of its file when the
file size exceeded 4GB. Messy. The other allowed two connections to
access the same database file without locking against each other.
Every corrupt Firebird database I've seen was running with forced writes off.
Yes, there are ways to get "logical" corruption, like nulls in indexes that
disallow nulls and records that violate constraints. That's a different
issue, and one which we should take up seriously. Some of those problems
occur because the system doesn't validate existing data when adding a
constraint. Others are probably bugs. But at the physical level,
maintaining consistency is just a question of following the rules about on
disk pointers and maintaining the dependency graph correctly.
The reason I got upset last night - aside from a bunch of stuff that had
nothing to do with your message - is that I have written a lot about this
topic. It's available on IBPhoenix and in the archives of this
list. Consistency on disk is a major feature of Firebird/InterBase and has
been for going on 20 years. Having someone posit that it might be possible
and ask if anyone has considered it ruffles my fur.
on? Has anyone on the list done so? If so, you may have critical
information that could lead to finding and correcting a bug.
you can and still have a consistent database when you reconnect the disk to
something running Firebird. I've got a significant number of logs that
show the server dying horrible deaths, over and over. The guardian
restarts the server and the database comes up undamaged.
see it.
stored on a different page from the primary version, the page with the
primary record version depends on the page with the back version. There's
a pointer in the primary record to the back version. If the page with the
primary record version is written before the page with the back version,
someone could read that record, follow the pointer, and end up with garbage.
Let me give you another example.
When the engine needs to add a record to a table, it first checks whether
the current page has room. If not, it reads a pointer page (PPG) for that
table, looking for a page with space. If there's no page in the last PPG
with enough space for the record it must create a new data page.
First the engine creates a data page (DPG) image in cache and puts the
record(s) it needs to store on that page.
It then reads a free space page (PIP, page information page) into cache,
identifies a free page number, and changes the PIP to show that the page is
no longer free.
It then writes that page number to the pointer page (PPG) for the table
that it has in cache.
At this point, the engine has three related pages in cache, a PIP, a PPG,
and a DPG. The PIP must be written before the data page, or some other
thread will allocate the same page and overwrite the new data page. The
data page must be written before the pointer page, or some thread will read
the pointer page and try to follow its pointer to the not yet existing data
page. That's what I meant by interdependencies.
Generally it doesn't matter what order those pages hit the oxide because
the file system cache has the current versions of the pages. If, however,
the system stops suddenly and must restart with only the information that
is stored on disk, the order of page writes becomes very important.
graph occurs when two pages become dependent on each other. An example
might be two records with back versions. Lets assume that the primary
record version for record A is on page 10 and the primary record version
for record B is on page 11. When A is modified, the system finds that
there is not enough room for both A and its back version on page 10, but
there is room on page 11. The back version of A goes on page 11, making
page 10 dependent on page 11. In the absence of a commit or conflict,
there's no reason to write either page immediately. Later some
transaction tries to modify record B. Now there's no room on page 11, but
page 10 now has space. The back version of record B could go on page 10,
but that would make page 11 depend on page 10. That's a loop in the
dependency graph.
The solution is to write page 11 without the change to record B, so that
page 10 now points to a legitimate, on-disk version of page 11. Then the
cached versions of pages 10 and 11 can be modified and when convenient,
page 10 can be written, followed by page 11.
Best,
Ann
> >"Careful write" is an architecture that follows a few reasonably simple
> >It is. That's what careful write is about. And we have it.
> >
> >
>Sorry, but what exactly is a careful write? Is that what you were
>describing above.
rules for on disk layout and more much less simple rules that govern the
order in which pages are written. The major disk layout rule is that no
structure that can be written on two pages can depend on two-way
pointers. For example, a record version has a pointer to the previous
version. The previous version does not have a pointer forward to its
parent. The page that contains the old version can be written first. If
the page that contains the parent is never written, the database is still
consistent, though a small amount of storage will be lost until reclaimed
with gfix or a backup/restore.
A counter example are index nodes which have pointers both left and right -
pathetic attempt at a diagram:
0-first page on level -> <- second page on level -> <-third page on
level ->
The right hand pointers are reliable. If you follow the right hand pointer
from the first node on a level to the second, the page returned will be the
next page in that level of the indeex. The left hand pointers are not
reliable. If you follow a left hand pointer from the third page on a level
back to the second, you may get the second page, or you may get some other
page. There is code to handle the case where a left hand pointer is not valid.
Operationally, "careful write" is a question of maintaining a dependency
graph among pages in cache. When changes to a page could cause a loop in
the graph, writing some number of pages will break the loop. Those page
must be written before the original change can be made. When a page must
be written, all the pages it depends on must be written first.
> Besides, FireBird isn't 100% corruption-proof, so...That's the design intention. Bugs and asynchronous writes are the exception.
> >The primary causes of corruption in InterBase databases were two bugs in
> >The worst that happens is that you get some orphaned back versions, no in
> >use. Or if the problem happens in some structural area, you may get an
> >orphaned page. Anything else is a bug. Plain, simple, ugly bug.
> >
>Well, in that case, what is corruption due to? Why can the database
>sometimes get corrupted upon power failure or system crash? It may not
>happen often, and it may not be likely to happen, but it does happen...
5.6. One caused the database to write over the front of its file when the
file size exceeded 4GB. Messy. The other allowed two connections to
access the same database file without locking against each other.
Every corrupt Firebird database I've seen was running with forced writes off.
Yes, there are ways to get "logical" corruption, like nulls in indexes that
disallow nulls and records that violate constraints. That's a different
issue, and one which we should take up seriously. Some of those problems
occur because the system doesn't validate existing data when adding a
constraint. Others are probably bugs. But at the physical level,
maintaining consistency is just a question of following the rules about on
disk pointers and maintaining the dependency graph correctly.
The reason I got upset last night - aside from a bunch of stuff that had
nothing to do with your message - is that I have written a lot about this
topic. It's available on IBPhoenix and in the archives of this
list. Consistency on disk is a major feature of Firebird/InterBase and has
been for going on 20 years. Having someone posit that it might be possible
and ask if anyone has considered it ruffles my fur.
>I'm not talking about the theory of "careful write", I'm talking aboutHave you actually corrupted a Firebird database that had forced writes
>the practice of database corruption. I don't know exactly what it's due
>to, and I was just imagining what might be the cause. If the cause I
>suggested is not actually the real cause, that still doesn't mean that
>the problem doesn't exist.
on? Has anyone on the list done so? If so, you may have critical
information that could lead to finding and correcting a bug.
>...the only solution is to kill theSorry. Firebird is designed to allow you to kill the server as brutally as
> >>server process (killing the client isn't enough).>>
> >
> >Sure. The server dies, the server is reborn, long live the server. And
> >the database.
you can and still have a consistent database when you reconnect the disk to
something running Firebird. I've got a significant number of logs that
show the server dying horrible deaths, over and over. The guardian
restarts the server and the database comes up undamaged.
> >> If the SP wasIf you've got the database log from the time of the corruption, I'd like to
> >>performing database writes, and not mere reads, it's very likely to
> >>corrupt the database, for obvious reasons (this is what happened in my
> >>case).
> >
>To me of course. If it's not obvious to you, then my assumptions must
>have been wrong. But that doesn't make the problem disappear. It just
>means that we don't know why it happened.
see it.
> >>... we could even afford to turn ForcedWrites off, without itBy which I mean, for example, that if the back version of a record must be
> >>affecting the stability of the database at all.
> >>
> >
> >No. Inter-page dependencies are a fact of life.
stored on a different page from the primary version, the page with the
primary record version depends on the page with the back version. There's
a pointer in the primary record to the back version. If the page with the
primary record version is written before the page with the back version,
someone could read that record, follow the pointer, and end up with garbage.
Let me give you another example.
When the engine needs to add a record to a table, it first checks whether
the current page has room. If not, it reads a pointer page (PPG) for that
table, looking for a page with space. If there's no page in the last PPG
with enough space for the record it must create a new data page.
First the engine creates a data page (DPG) image in cache and puts the
record(s) it needs to store on that page.
It then reads a free space page (PIP, page information page) into cache,
identifies a free page number, and changes the PIP to show that the page is
no longer free.
It then writes that page number to the pointer page (PPG) for the table
that it has in cache.
At this point, the engine has three related pages in cache, a PIP, a PPG,
and a DPG. The PIP must be written before the data page, or some other
thread will allocate the same page and overwrite the new data page. The
data page must be written before the pointer page, or some thread will read
the pointer page and try to follow its pointer to the not yet existing data
page. That's what I meant by interdependencies.
Generally it doesn't matter what order those pages hit the oxide because
the file system cache has the current versions of the pages. If, however,
the system stops suddenly and must restart with only the information that
is stored on disk, the order of page writes becomes very important.
> By avoiding loops in theThe dependency graph is the relationship between pages. A loop in the
> >dependency graph, you can establish an order of page writes that will never
> >create a dangling dependency.
graph occurs when two pages become dependent on each other. An example
might be two records with back versions. Lets assume that the primary
record version for record A is on page 10 and the primary record version
for record B is on page 11. When A is modified, the system finds that
there is not enough room for both A and its back version on page 10, but
there is room on page 11. The back version of A goes on page 11, making
page 10 dependent on page 11. In the absence of a commit or conflict,
there's no reason to write either page immediately. Later some
transaction tries to modify record B. Now there's no room on page 11, but
page 10 now has space. The back version of record B could go on page 10,
but that would make page 11 depend on page 10. That's a loop in the
dependency graph.
The solution is to write page 11 without the change to record B, so that
page 10 now points to a legitimate, on-disk version of page 11. Then the
cached versions of pages 10 and 11 can be modified and when convenient,
page 10 can be written, followed by page 11.
> > But if you let the operating system chooseI hope I've made the problem with random write order more clear.
> >the order of page writes all bets are off.
> >
> >
>Well, I guess my understanding of the structure is lacking, because I
>can't quite follow the above reasoning. I'll look up the documents you
>refered to, and perhaps I'll see things more clearly.
>Anyway, if the structure is such that it makes this sort of thingI don't see any way to avoid inter-page dependencies...
>impossible, couldn't be structure be altered? :-)
Best,
Ann