Subject Re: Super-transactions and Incremental Backup
Author Jason Wharton
> Jason has proposed a scheme to support incremental backups. His
> proposal touches upon a fair number of important architectural
> mechanisms with the engine and toolset, and is worthy examining
> in some detail. During the discussion I suspect several components
> of the proposal will drift, but let's stick with the various pieces
> separately and try to bring them back together at the end.

I'll answer these questions as best I can. Not knowing the sources and
internal architecture of InterBase I am going on pure speculation and
assumption. Hopefully I'll be close enough to make this at least interesting
discussion material.

NOTE: Due to my time constraints this is going to be pretty rough reading.
I'll edit what I can for readability but you may need to go through this a
few times before it clicks for you. (If at all)

HINT: It doesn't follow a rock solid consistent model either.

> In the interest of creativitity, I'll try and limit my remarks
> to questions and elaborations on the current architecture and tricky
> problems. If we get mired, I may suggest a different direction.
> To improve the tractibility of disaster recovery for large,
> essentially stable databases, Jason proposed a system involving:
> 1. A mechanism to create a stable freeze point within a database
> from which a backup may be done and post freeze point updates
> can be located.
> 2. An incremental backup utility to backup record versions created
> post freeze point.
> 3. An incremental restore utility to apply incremental backups
> to database restored to a freeze point.
> The backup strategy would be as follows:
> 1. DBA creates freeze point.
> 2. DBA performs backup to freeze point.
> 3. DBA periodically takes incremental backups from freeze point to
> current state.
> 4. To recreate the database, DBA restores database from #2 to
> freeze point, then applies incremental backups.

Very well stated. It does appear that my intentions are understood.

> Question 1: Freeze Point Implementation
> Could you describe what information is kept internally to implement
> freeze points? Do keep in mind that update transactions may be in
> progress at the freeze point that may either commit, rollback, or
> enter limbo subsequent to the freeze point declaration.

I think of a freeze point as a special transaction that is flagged as a
special milestone marker in the chain of transactions that the engine
produces. If other transactions are still unresolved at the time a
freeze-point transaction is committed then they will become considered
post-freeze-point material if they are committed.

The information that would need to be maintained is an additional
transaction state and some other stuff stored elsewhere to describe the name
of the freeze-point and what ever information might be good to have like the
username, timestamp, a super transaction ID of some sort, etc. so that it
can be worked with administratively.

As far as other information goes it depends on what you want that
freeze-point to be able to facilitate. If you want to be able to do a
super-rollback and make it so the database can rollback everything from the
time a freeze point was declared or return to that point and do a query as
if you went back in time then it becomes necessary to maintain an extra
record version in the record version chain to maintain that snapshot of the

(This doesn't seem necessary since a restore would accomplish this but let's
entertain it for a little anyway.)

There is no need to make an extra copy of pre-freeze-point records. When a
record is changed, a version needs to be left behind for reproducing the
database at that snapshot. A record version will add a link to the
transaction if that isn't already the case and a record should also be
flagged as a super-transaction place holder record version so that it can be
left alone by the garbage collector.

The garbage collector would need to be made smart enough to leave in-tact
this one special record version (or one per each named transaction in the
chain) but it could consolidate records changed within on super transaction
and keep things nice and tidy wherever possible.

Where this gets ugly is:
I'm not sure how the data is going to be queried up if the database is being
read at that snapshot level again. Will indexes also contain back records to
maintain optimized searches at the frozen point? Seems the handling of the
index's will get rather ugly having to maintain an extra record version for
changes to the indexed data. I'm not sure how indexes are formed and how
they could avoid the problem of duplicates and stuff like referential
integrity. As far as I can tell it would be too ugly to accomplish this.

I fully agree that it is very advantageous to keep the record versions as
clean as possible. This is a potential critical flaw in the whole model I am
proposing. Perhaps what I am proposing may be a vehicle and in the end we
can flush out the record versions into a journal of sorts by a process
similar to garbage collection. We'll see...

I think it is only worth exploring the option were the freeze point is there
simply to maintain a baseline such that incremental backups can be performed
from that point forward. That's the mode I'll be in for the rest of this

> Question 2: Garbage Collection
> I gather that the idea is to retain the newest pre-freeze point
> version of a record during garbage collection. How is that
> record identified? Keep in mind that that version will require
> special handling beyond just identification. Old record versions
> are usually stored as deltas from the succeeding version. In
> the case of the pre-freeze point version, the succeeding version
> will be volatile, so pre-freeze point version will need to be
> restored as a full record.

If these record versions have a transaction ID of some sort attached to them
then it would be altered to reflect the most appropriate frozen transaction.
It may be necessary to enhance the record version to denote that it is a
frozen version so that it will remain a full record and not get changed into
a BDR.

Since the record would most likely be in its non-diff'd state it could
remain that way and allow the tip of the record chain to proceed as normal.
It too would be the non-diff'd state and then as subsequent updates take
place it would accumulate BDR versions subject to garbage collection as

> To preserve database integrity for classic, all update algorithms
> must be careful write (meaning an object must be written before
> a pointer to it), and to avoid deadlocks, all on disk data structures
> must be traversed in a prescribed order with "handoffs" (object
> is fetched then pointer is released). In specific, record version
> chains must be traversed version by version, releasing page
> lock on the prior version. Because you can never back up directly,
> if you need the prior record, you must go back to pointer page,
> the data page, record version to record version, while dealing with
> the fact that somebody else may have intervened, changing almost
> anything, even doing the garbage collection for you. Needless to
> say, this makes garbage collection very, very tricky. The trick
> in this case will be to fetch the garbage collect chain, find the
> survivor (pre-freeze point version), replace the back pointer
> of the head record with the survivor, then finish index and blob
> garbage collection from the version chain. Poor Ann spent about
> three months getting the last of the race condition bugs out of
> VIO_backout, so prepare for a challenge.

That certainly does shed a light on issues that will be challenging to deal

Traversing this structure and reconstructing the differences to be applied
in their correct order when applying an incremental backup will very likely
be quite a challenge. I don't have a clear vision on how all of this could
be done. I do have some ideas that will be lightly touched on below.

In short, it will be a mechanism to write record changes out in their
committed order. Thus, incremental backups will be a stream of deltas in
committed order at the record level and ignore stuff like indexes. Just data
& metadata.

> Question 3: Garbage Collection
> The engine currently does a garbage collection test on every record
> visit, but the test is dirt cheap: If a record is older than the
> oldest interesting transaction and has a back pointer, it's time
> to garbage collect. If your scheme, many non-garbage collectable
> records will appear to be garbage collect candidate by the above
> test. The only way to find out whether a garbage collection cycle
> is required is to follow the back pointer to fetch the old version
> and get its transaction id. Whether or not it is collectable, the
> engine will need to re-fetch the head record by re-fetching the
> pointer page then the data page. These are virtually guarenteed
> to be in cache, but the overhead is significant. Do you have any
> ideas on how to avoid this retrieval performance hit?

I don't other than perhaps some additional information being held in the
record version. For example, the garbage collection process could be
responsible to discover and flag records as "frozen". Then, future attempts
at garbage collection would recognize this and leave it alone. Thus, it
would take a one-time hit but not be a total burden for future passes.

If the frozen transaction is to be able to be rolled back then this too
would need to be cleaned up, so you are back to the case where more stuff in
the transaction information area would need to be referenced during garbage
collection. This is another reason against trying to preserve a freeze-point
for the purpose of querying a snapshot of the database (thus being able to
rollback to it too.)

> Question 4: Architecture (ok, almost a hint)
> Why do you need to reconstruct a freeze point? Would an alternative
> mechanism that identified post-freeze point transactions without
> impacting garbage collection be sufficient? After all, the incremental
> backup doesn't care about the world before the freeze point, just
> after.

It only depends on the problem that is being solved. If a backup is done and
the DBA doesn't ever want, need or care if that snapshot view of the
database is made available with the production database file then this is
unnecessary. If it is needed then the backup is simply restored and there it
would be. As I stated above, I'm in favor of the simplified approach of not
requiring the database to store the actual snapshot of the data.

> Question 5: Backup Utility
> Is this a totally new utility or a tweak to gbak? If gbak, which
> uses the published API, how does it tell the engine that it wants
> only incremental records? Do we need a new API, new request options,
> new BLR, new SQL, or what?

Ideally it should continue to be a tool that operates through the published
API. This means that I think the necessary features to accomplish this
should be extensions to the API.

For example, GBAK could be told to establish a freeze point transaction when
doing its full backup. This could simply be the addition of a TPB with the
other pertinent information like the Name of the transaction point. Thus,
GBAK would map into an extension to the TPB. No changes to an API call or
adding a new one would be necessary there.

Producing the incremental backup is an entirely different matter. You are no
longer simply looking at the entire database. Instead, it needs to have
special access to suck information out of the database and build up a stream
of table/record change commands that act like a replay of the record level
alterations that have occurred in the database since the specified freeze

This will most likely affect things heavily. I believe that a new API call
and supporting parameter options with some structs would be necessary to
initiate a request to have the delta information streamed out of the
database. A standard format for the record deltas would need to be designed.
Thus the call to open a request to fetch deltas would need to have a
parameter to tell it which frozen point to start from and optionally which
one to stop at or got to the current tip. Then a call to fetch in a loop
until all of the deltas are streamed out.

The reverse of this would be to take the deltas and apply them to a restored
frozen database. The request would be opened and then the deltas would be
streamed in and applied on the server to the database. The entire super
transaction should be considered a single stream of data rather than trying
to isolate each individual transaction that committed due to user's efforts.
I suppose one could go after that level of granularity but my requirements
wouldn't demand it. Perhaps both modes could be supported. batch or granular

This would be enhancing things so that as each transaction with changes to
the data take place there would need to be a chain of these stored up. It
would be necessary to store along with these transactions the "pages
affected" so that one could walk the transaction history and be able to
receive the data page pointers to all records that the transaction affected.
This way, a full database scan would not be necessary to produce an
incremental backup. It would just be necessary to go one transaction at a
time and hit the pages that were affected by it.

The format for keeping track of the data pages hit by a transaction would
need to be optimized to handle cases where a large number of inserts were
performed. For example, instead of storing an array of pointers
1,2,3,4,5,6,7 you could simplify this to 1-7. This way, when a transaction
is in progress it is keeping a bitmap of what all data pages are involved
and then upon being committed it will get added to the chain of committed
transactions with edits and the bitmap of data pages affected will be
associated to it.

This transaction history with pointers to data pages affected could be
extracted and wadded up into a journal by a worker thread similar to the
garbage collection process. So, some of the concepts I am diving into here
could be looked upon as a way to make InterBase able to go through two
phases of garbage collection. One phase would be to work on the history of
committed transactions and follow the data pages affected and roll those
changes into a journal so that the subsequent real garbage collection can
then come through and sweep everything out of the system. Thus, a replayable
log is generated without much direct overhead to the user committing a
transaction and finished up by a system process similar to garbage
collection bundling it all up in committed order.

Redundantly stated:
The benefits I see to this is the only additional overhead at commit time is
to add itself into the trans (with changes) history chain and to write the
bitmap of the data pages affected. Seems (to a total novice) this should be
easy to represent without having to cause too much I/O. Then, in a
background process the work of getting everything into a log can take place

(hopefully I've been using the term 'bitmap' correctly)

Then, instead of the incremental backup having to really rattle the disk or
walk the entire database there will already be a stream of record-level
deltas in committed order to be applied to a base backup at the freeze point
specified. This stream of deltas would eventually end up containing the
named points and be where the actual administrative control would be
executed from.

By having this delta information all consolidated into a separate section of
the database (or perhaps in an external file) this would allow the database
to keep the garbage collection in the "data work area" clean as it presently

> Question 6: Backup/Restore Phantoms
> I think I understand how the backup utility gets records created
> or modified (though not how to tell which is which) post freeze-point,
> but I don't understand how it gets records deleted post freeze-point?

The delete would come through as a record-level delta specifying that the
record should be deleted.

This leads to the question of how does it know which record in the restored
base version to delete (same goes for edits).

If the base backup is kicked off with a named freeze-point, this is like
saying, it is intended to have subsequent deltas applied to it. So, in this
case the total backup should be designed to populate records such that a
column is maintained for the record's previous DB_KEY value. This way, the
records former DB_KEY will be available to match up with the incremental
backup's delta's as they are applied. Thus, the restore would be allowed to
restore a record to a new DB_KEY (physical location) and still maintain the
ability to synchronize with the source database with its DB_KEY's.

This way, if a restore was being performed for the purpose of serving as a
base to have incremental backups applied to it the ability to marry together
record chains would be preserved while at the same time allowing the
flexibility to let the records assume a different physical location in the
new database..

Add this requirement to Q#1 above that a system column to the record
structure may be needed when restoring a base version to have incremental
backups applied to it.

> Question 7: Backup/Restore Limbo Transactions
> How does backup (and restore) handle transactions in limbo?

I presume you mean the case when a two-phase commit operation failed part
way through and there are limbo transaction records in the database? I don't
know. I think they should be preserved and restored as-is.

> Question 7: Restore Strategy
> How does restore work? Does it use the published API? How does
> restore know when to insert a record, modify a record, or delete
> a record?

Ideally it should use a published API.

As I mentioned above, it may be necessary to carry forward some additional
information about a record's prior DB_KEY while allowing it to be mapped
into a new DB_KEY location. This way the incremental backups applied would b
e able to keep the association between the former physical layout of the
records and apply themselves to the base database.

An incremental restore would only apply record-level changes and all impacts
to indexes, etc. would be derived. There should be an option to tell whether
indexes should be inactivated and reactivated or just left live during the
incremental restore. It is just applying the changes as happened so there
will be some disk thrashing for sure.

The incremental backup would simply be a stream of record-level delta's with
the DB_KEY of the record from the source database. The delta would contain
whether or not it was an add, del or edit. As far as making it optimal to
locate the previous DB_KEY it may be necessary to maintain an index for each
table's prior DB_KEY so that it wouldn't take too long to apply the
incremental backup.

An insert would not have a prior DB_KEY value to look for but it would carry
the new DB_KEY from the source database and store it so that if an edit
comes in later it will still be able to be married up to it.

The delete and edit deltas would also come with the DB_KEY to marry up to
the former record and be applied. If there was a new DB_KEY for the record
in the source then this would come along too, like in the case of an insert
and be put in place so that future deltas would be able to marry up to it.

> Question 8: Restore Referential Integrity
> How does restore handle referential integrity checks when records
> have been backed up in physical order?

I'm not sure. My thinking has been to assume that the database has guarded
against any violations and that when a restore is performed that it will be
doing a transaction by transaction record-level replay of the changes.
Oldest transaction to the most recent. So, because I use committed order
rather than physical order this shouldn't be an issue, or at least not the

I sure hope this makes sense. Please pardon the rambling nature of this. I
have other applications I need to meet some deadlines on and this has taken
hours I don't really have...

Jason Wharton
CPS - Mesa AZ

----- Original Message -----
From: "Jim Starkey" <jas@...>
To: "Jason Wharton" <jwharton@...>
Sent: Monday, June 26, 2000 9:37 AM
Subject: Super-transactions and Incremental Backup

> Jim Starkey