Subject Re: [firebird-support] careful-write documentation
Author Ann W. Harrison
unordained wrote:
> Does anyone have precise documentation on the order and
> algorithm of careful-write?

Yes - it's the code, except in areas where the code is buggy.

> I've found stuff by Ann and Paul (Beach), but they're kind of vague on
> some points ...

I specialize in that.

> the write-ahead-log is disabled, in favor of another mechanism I -almost-
> understand (and even then, only barely for data pages -- what about
> writing to an index?).
>

There never was a write ahead log. At one point, there was an optional
after image journal which was slow and a required somebody to baby-sit
the journal device. Only applications that required two-swing of the ax
reliability used the AIJ. During the Borland ages, a guru from Sybase
decided that the careful write plus non-deadlocking data structures was
theoretically unsound and wrote a write-ahead-log package. It had only
two problems. First, it was still one-swing of the ax, and second, he
could never make it work.

There is no bits and bytes level to careful write. It's all page level
stuff. Reduced to its essence, the algorithm says that the database on
disk will always be internally consistent. More practically, that means
that when you write a page that references another page, that other page
must have been written previously in a state that supports your pointer.
Before you write a page that has a pointer from a record to its back
version on another page, that other page has to be on disk. Before you
write out a new data page, you must write out a version of a page
information page (PIP) that shows the page is in use. The new data page
has to be on disk, formatted and happy, before the table's pointer page
that references the new page can be written

Inter-page relationships are handled in the code through a dependency
graph that the cache manager handles. Before a page is written, the
cache manager checks the graph and writes out all pages that page
depends on. If a change will create a loop in the graph, as many pages
are written immediately as are necessary to avoid the loop.

The tricky bits are identifying dependencies and avoiding the impossible
situation - well, those and keeping the system fast. Identifying
dependencies just requires meticulous coding. If you have to put a
record back version on a different page from the main version, the page
with the pointer has a dependency on the page with the back version. If
you allocate a new data page, that data page has a dependency on the PIP
that records whether the page is in use, and the pointer page that ties
the data page into the table has a dependency on the data page.

The impossible situation is one where pages point to each other in a way
that can't be separated. Two pages can point to each other - you can
have a primary record version on page 124 with its back version on page
125 and a different record with its primary version on page 125 and a
back version on 124. The chances that the cache manager will find a
cycle in the dependency graph are high, and one page may need to be
written twice in the process of flushing the pages out, but it works.

If, on the other hand, you need a double linked chain of pages - index
pages come to mind - then each page depends on the other and neither can
be written first. In fact, our index pages are double-linked, but the
reverse link (high to low in a descending index) is handled as
unreliable. It's used in recombining index pages from which values have
been removed, but not for backward data scans. The architecture group
is currently discussing ways to make the reverse link reliable enough
for retrieving data, but we haven't agreed on a solution.

For those who haven't spent an embarrassing part of their adult lives
worrying about on disk consistency and double-linked lists, let me try
to explain. Lets suppose that the leaf level of an index consists of
pages 124, 125, and 126 in that order. Each has a pointer to its left
and right neighbor. Now you want to add a new entry to the index. It
has to be put on page 125, but page 125 is full, so you need to add a
new page between 125 and 126.

Lets expand the example slightly. Assume that each index page can hold
only four values - instead of the hundreds that it actually holds. So,
page 124 holds A, B, D; page 125 holds F, H, J, L and page 126 holds N,
P, R. In my notation below, the page number is followed by the index
entries in parenthesis. This "<=>" is a double link. This "<-" and
this "->" represent single backward and forward links. Links that skip
over pages are over the line of index pages: backward "/----" forward,
"-----\", and "/-----\" double. This structure can be scanned in either
direction.

124 (A,B,D) <=> 125 (F,H,J,L) <=> 126 (N,P,R)

You want to store K.

The way the index code handles this sort of problem is:

1) Read the current PIP (page information page) to find a free
page - lets say it's 246
2) Change the PIP to reflect that page 246 is not available
3) Set up a dependency so that PIP will be written before the new
index page
4) Format the page as an index page
5) Copy half the index entries from the page that overflowed onto
the new page
6) Copy the pointer to the next index page from the page that
overflowed onto the new page
7) Make the new page point backward to the page that overflowed
8) Write the new page

At this point, page 125 still points forward to 126, which points
backward to 125. There are two copies of the index entries for J & K,
but that doesn't matter because there's no way to get to page 246 - it's
not in the upper index level yet and will be skipped by a scan,
regardless of direction.

/---------------------------\
124 (A,B,D) <=> 125 (F,H,J,L) <- 246 (J,K,L) -> 126 (N,P,R)

9) Fix the upper levels so they include the first value of the new
page. That change may cause an upper level page to overflow,
resulting in the same series of steps at that level, and so on
up to the top. If the very top page splits, the index gets a
new level.


Now, the upper level entry for J points to 246 rather than 125. Scans
still work because anything that starts lower than J will skip node 246
and anything higher than J will skip 125.

10) Remove the copied index entries from the page that overflowed
and change its back pointer to point to the new page.

11) Write that page

At this point we have a structure that looks like this:

/--------------------------
124 (A,B,D) <=> 125 (F,H,) <=> 246 (J,K,L) -> 126 (N,P,R)

A forward scan still works, but a backward scan that starts with N or
higher will never see the values J and L. The code that handles
recombinations does a lot of sanity checking and quits when it sees a
problem. That strategy doesn't work for record retrievals.

12) Fix the back pointer on the page after the new page
to point to the new page.

Now the structure works again.

124 (A,B,D) <=> 125 (F,H,) <=> 246 (J,K,L) <=> 126 (N,P,R)



There are a couple of unavoidable awkward situations that occur during
page allocation and release, and result in orphan pages and orphan back
versions. Orphans are wasted space but do not affect the integrity of
the database.

At the page level, gfix will sometimes report orphan pages after a crash.

If the system crashes after a page has been allocated and the PIP
written, but before the pointer page that makes the data page part of
the table has been written, that data page becomes an orphan. Note that
the data on that page is uncommitted because the change that commits a
transaction - writing the transaction inventory page with the
transaction marked committed - does not happen until all page that were
created by the transaction have been written.

If the system crashes after a pointer page has been written, removing an
empty data page from a table, but before the PIP has been written to
reflect that the page is free.

If the system crashes in the middle of dropping an index or table, gfix
may find lots of orphan pages - a single write releases all the pages
that were part of the table or index, and that write must happen before
any of the PIPs can be changed.

Back versions must be written before the record that points to them and
can not be removed until after the pointer to them has been cleared. A
crash between those steps makes the back version an orphan - it occupies
space but is not connected to a record.

Hope this helps some..


Regards,


Ann