firebird-architect - Group Commits

Subject	Group Commits
Author	Jim Starkey
Post date	2004-11-06T16:34:30Z

The gating factor in readonly Vulcan transaction performance are the
cyclical writing of the header and transaction inventory (tip) pages.
The superserver solution is sufficiently unpalatable that some
rethinking is in order, which is pushing me in the direction of group
commits. I'm sure other people have been thinking about this as well,
so I thought this would be a good time to exchange thoughts.

To re-cap transaction order of battle:

1. A transaction is started by taking out a transaction id from the
database header page
2. A transaction may make some updates. Let's call the pages
affected primary update pages.
3. A transaction may perform some garbage collection. Let's call the
pages affected secondary update pages
4. A transaction is committed by first flushing all its dirty pages
(both primary and secondary) from the page cache, then updating
and writing the tip to reflect the new transaction state. Only
after the tip as been written (quote safe on oxide unquote) is the
user notified that the transaction is complete.

Dirty cache pages are currently associated with a transaction with a
mask of transaction id modulo 32 of transactions that have dirtied the
page since it was read or last written.

For the purpose of analysis, we have to consider the state of the
database on disk if the engine is suddenly stopped at any point. We
assume that a page has been written by the operating system when it
returns from the write call. We know this isn't always true, but
there's little we can do about it.

For readonly transactions (transaction that perform no primary update
pages), there is no reason to force either the header or transaction
inventory pages to disk. They'll get written sooner or later (or maybe
never), but this doesn't make any difference. A readonly transaction
can safely piggyback on a later readwrite transaction. It would be
nice, however, to eventually flush any secondary updates pages.

Readwrite transaction must write the header page before writing any
primary update pages, and must write all primary update pages before
writing the tip.

The obvious solution to group commits is a commit thread that
periodically wakes up, checks for transactions pending completion,
updates the header page if dirty, flushes dirty pages belonging to
transactions pending commit, then updates the tip. All pending
transactions are then committed. A plausible commit cycle would be four
or five times a second, but would, of course, be settable by
configuration file. Lets call it five times a second for discussion.

The scheme means that the header and tip pages would be written at most
5 times a second without regard to the number of transaction that
transpire. The scheme reduces both the header and tip hot spots as well
as any operational hot pages (index, generators, reads used as locks,
etc) by reducing page changes by many transactions to single physical
write per page.

The scheme seems to work equally well in superserver, classes, and mixed
environments.

The implementation would be a class CommitManager with a single instance
hanging off the Database object. TRA_commit, et al, would call the
commit manager to perform cache flushing and tip management. With group
commit turned on, a commit would add the transaction to a list, "or" the
transaction mask to a group flush mask, and wait for a wake up. When
the group commit thread wakes up, it writes the database header page if
necessary, uses the group flush mask to flush the cache, updates the
tip, then wakes up the snoozing transaction threads.

It seems so simple and obvious I can't imagine why it hasn't been done.
Have I missed something?

[Non-text portions of this message have been removed]