firebird-architect - Disaster Recovery Strategy Requirements (1st draft)

Subject	Disaster Recovery Strategy Requirements (1st draft)
Author	Jim Starkey
Post date	2000-06-19T15:02:05Z

1. Problem Statement

A fundamental rationale for database systems is to maintain the
integrity of data in an hostile and inherently unreliable world.
Correspondingly, database management systems provide protection
from transactional anomolies with a commit/rollback mechanism and
from hardware and system (including correct but erroneous commands)
through backup and restore systems, shadow volumes, and redundant
disk systems.

The original InterBase backup and restore utility, gbak, uses the
published database API to create a self describing file containing
both database meta-data and table images. Gbak is robust but expensive:
Backup requires a complete pass over a database; restore creates a
new database, populates it, and recreates all indexes.

Database shadowing was added in version 3 to permit maintenance of
a hot backup copy of a database. Shadowing is a good solution for
protection against a disk failure, but offers no protection against
either operational error (correct but erroneous commands) or progressive
internal corruption.

Large and very large databases are poorly served by either gbak or
the shadowing mechanism. Very large databases are simply too big
to gbak frequently and take too long to restore following a disaster.
Shadowing is a best a partial solution, leaving a database vulnerable
to progressive corruption and catestrophic operational error.

Very large database systems are generally archival in nature and
therefore relatively stable. Portions of the database may be
volatile but the bulk of the database changes slowly and then
generally by extension rather than modification. These
characteristics provide opportunity for alternative disaster
recovery strategies to gbak and shadowing.

2. Requirements for a Disaster Recovery Strategy

A disaster recovery strategy for InterBase must:

a. Protect against hardware (disk, controller, cpu, memory),
operational error (correct but erroneous commands), and
progressive internal corruption induced by either bugs
or transient hardware failures.

b. Allow a DBA to intelligently tradeoff post disaster recovery
time against backup maintenence resource utilitization.

c. Support recovery without loss of committed data following
a complete hardware failure.

d. Support recovery without loss of committed data to the
point of onset of an operational error, software error, or
corrupting transient hardware error.

e. Support recovery without loss of committed data from any
single detected point of failure.

f. Support incremental backup (or equivalent) at a resource
cost in proportion with the volatility of the database.

g. Allow transfer of database images to suitable archival
medium.

h. Support 24x7 availability of databases.

3.0 Technology Discussion

Clearly, some form of incremental backup facility is required. There
are, happily, many alternative.

Traditional operating system incrementation backups operate at file
level ganularity to streaming devices to create discrete incremental
backup files. Recovery is achieved by first restoring a complete
backup file then applying, in order, incremental backups. The cost
of complete recovery is relatively high (the systems doallow
recovery of individual files). At the time these systems were
designed, removable streaming media (tapes and the like) were
vastly cheaper on a per byte basis than random access rotating media.

Todays very large extremely cheap disks offer alternatives unavailable
a decade ago. One of these is an incrementally updated backup image.
In this scheme, an backup database image is periodically
incrementally resynchronized with the active version. The cost of
synchronization can be portional to the volatility of the database,
so relatively frequent synchronizations are feasible. Following
synchronization, the backup image can be copied to streaming media
for archival purposes.

A problem inherent in an incrementally updated backup image is that
during synchonization the backup image is internally inconsistent
and that a disk crash during synchronization could result in loss
of both the primary database and the backup. To guarentee data
integrity, either the backup image must itself be copied before
sychronization or (better) two independent backups are used
alternatively. The latter option has the effect in increasing
disk requirements to three times the active database. A side
effect of supporting multiple incrementally updated backup images
is that "last incremental" information cannot be stored solely in
the active database. There are obviously other solutions to this
problem that need to be explored and discussed.

Any disaster recovery system capable of recovery from operational
errors must support either an undo or a redo log (some system
support both). Although undo logs are probably better for recovery
from operation error, they must be applied to a current database,
making them all but useless for recovery from disk failure. Also,
unlike the original InterBase journalling scheme, the redo log
must contain index update records to avoid the need to recreate
indexes following a roll-forward operation.

Jim Starkey