firebird-architect - Re: [IB-Architect] Backups of large database & super transactions

Subject	Re: [IB-Architect] Backups of large database & super transactions
Author	Charlie Caro
Post date	2000-06-19T19:05:04Z

Sometimes it's useful to examine the comparative architecture of other vendors.
While I might provoke the ire of the InterBase nation, just here me through
before announcing to the world, "Pox to Charlie!"

An InterBase database is a repository of data consisting of a single-level
schema and exists an automous, administrative unit. The physical structure of an
InterBase is the following:

Database
File 1
Table 1
Index 1
Index 2
Table 2
File 2
Table 1 (cont'd)
Table 3
File 3
Index 1 (cont'd)
...
File N

An InterBase database is a multi-file database consisting of a contiguous array
of database pages. All database objects may be randomly mapped over the multiple
file space.

Other database vendors have defined containment relationships between multiple
partitions (Oracle, Informix) or areas (Rdb/VMS). If all physical (page
pointers) and logical (referential integrity, triggers) relationships are
restricted to database objects within the same area/partition then it becomes
possible to consider each area/partition as a separate administrative unit for
the purposes of backup and recovery. (To tell the truth, I know those vendors
have partitions but do they define the type of containment relationship for
backup/restore that I am suggesting?)

The organization of an such a database might be:

Database
Area "Root" Area "Finance" Area "Sales" Area "Manufacturing"
File 1 File 1 File 1 File 1
Table 1 Table 10 Table 100 Table 200
Index 1
Index 2 * * *
Table 2
File 2
Table 1 (cont'd) * * *
Table 3
File 3
Index 1 (cont'd) * * *
...
File N

Each area/partition might be considered multiple InterBase databases. However,
this organization has "denormalized" multiple databases into the same database.
For a single site DBMS instance this would allow cross-area (database) joins
using the existing InterBase programming environment. Each area is extended
separately to avoid hot spots in the database.

Or different areas could be labeled "production", "semi-archival", "archival".
The "production" area would be a smaller amount of data that could be backed up
nightly. The "semi-archival" might be backed up weekly. And the "archival" would
be backed up fortnightly or monthly.

Gbak would be modified to backup/restore the entire database or individual
areas. Gbak would also merge a database into another database as a new area or
extract an area as a separate database. Of course, this organization would be
transparent to most users who are happy today with a single, root area for all
their database objects.

It also obviates the need to make gbak perform parallel backups or restores. You
would just code a script with multiple gbak lines -- one for each area. (But it
might help if we improve SMP support in the engine.)

Why do I use the term "areas" instead of the more popular "partition" term? I've
thought about this database organization with respect to an InterBase shared
disk, cluster architecture. The idea was to create a local affinity between,
users, servers and data that would minimize page pinging and distributed lock
manager chatter over an interconnect. Thus "areas" seemed more appropriate than
"partitions".

Since the idea has a tangential relationship to organizing and managing
backup/recovery for VLDBs, I thought I would bring it up.

Regards,
Charlie

Jason Wharton wrote:

>
> > First the problem, then the solution.
>
> I thought I had made the problem sufficiently clear. I'll recap and extend
> it some to clarify.
>
> Requirements:
> Huge database. (many GB worth of data)
> Database needs to stay on-line 24X7.
> 98% static & generally inserts only.
> In case of failure, downtime needs to be minimized.
> Immediate failover isn't critical.
> Cannot lose over one day's work but up to that is tolerable.
> All input sources are available for two days to be re-entered if necessary.
>
> Problem.
> Obtaining an efficient backup on a regular basis.
> Efficient meaning the total I/O and CPU toasted in the process, among other
> things.
>
> Problem extension #1:
> Same huge database.
> Needing to be sync'd with many remote copies on a daily basis.
> Real-time is not necessary.
> Remote copies are for inquiry only.
> Potentially low and/or expensive bandwidth constraints.
> New base versions are express-mailed via DVD media or DAT.
>
> Problem extension #2:
> Same huge database.
> DBA needs to cancel out an entire days work due to certain problems
> encountered.
> Database needs to stay on-line.
>
> Have at it. That should cover most of it.
>