firebird-architect - SOME SOLUTIONS TO OLD PROBLEMS

Subject	SOME SOLUTIONS TO OLD PROBLEMS
Author	dcalford
Post date	2000-06-19T04:27:30Z

Ok, here is some of what I am putting together, it is far from complete
due to my forgetting that it is fathers day and (as I am a new father) I
have had a few other obligations to take care of.

I am posting the first bit and will add to it over the next few nights.

I am going to be posting some SQL sample code when I get into the final
details, right now I am concentrating on some of the basic concepts and
why we went with the design we did.

I hope this helps some of you out there in some way.

best regards

Dalton
----------------------------------------------------------------

Introduction
24x7 Overview - Method of Approach
Hardware Configuration
Server Configuration
Different Methods for Backing up your Data
Replication
Reversing User Actions
Advanced Techniques

------------------------------------------------------------------------

Introduction

Ok,

First, let me say that this is a mental core dump, there are bound to be
mistakes, typos, forgotten gotcha's and all sorts of things to make even
the most forgiving person out there groan with frustration.

This is not meant to be a final draft or anything near it. If you do
not like this, don't read it. (or better yet, point out areas where I am
making mistakes or not being clear in what I am presenting).

This is not meant to be looked upon as the only method of approach for
the problems that are put forward, in fact, after reading this, a few of
you will no doubt come up with even better methods to accomplish the
same thing. THIS IS MY SUBTLE HINT TO SAY, SHARE THE IDEAS. We all
benefit and your overall karma will be improved.....
:)

------------------------------------------------------------------------

24x7x365 Overview - Method of Approach

Many people have seen the term 24x7 and think that the solution is based
upon the database software or the hardware platform. The problem is, no
software or hardware vendor can even begin to create a package that
would begin to approach the particular needs of any one developers
situation. A vendor must make tradeoffs in order to satisfy the largest
target market. This means that a developer has to really analyse thier
needs and design the best solution that applies to them. The first
question that a developer needs ask is whether the tools at hand are
versitile enough to allow the creation of a design that handles all the
applications criteria.

Interbase, due to a radical (at the time) design approach, is extreamly
well suited to a 24x7 configuration, as long as the developer
understands the server and the tools available.

With a proper database design, any one can create a 24x7 system,
regardless of the Hardware, Operating System (including NT) or (heaven
forbid, a different server software).

The other issue is hardware design, even with the most expensive systems
money can buy, failures will occur.

To begin with I will state the authors basic design approach that led to
the decissions that the code was developed under.

1) Hardware is cheap.

Hardware is getting cheaper by the day. In fact, with the
current growth in the speed and storage of computers, we will
soon be able to fit everything into almost nothing with
infinate speed (joke).

2) Time is expensive.

Even if you or your staff is not bringing down a few million a
hour, time is a precious thing. This means that you want to
be able to fix a problem in as little time so that you can
spend time with your family or takling new and more chalenging
problems.

3) Pay now or Pay Later.

This is a underlying phylosiphy of mine. If you spend the
time up front, you will reduce the time it takes later. If
you do not do the work now, you will end up doing three times
the work when you are faced with the problems of a poorly
thought out design.

4) Everything Fails

No matter how well you plan, you will have problems. This is
self explanitory and if you do not understand this, do a MAN
or INFO lookup on Murphy and or Entropy.

Now that you understand my thoughts on things, I will proceed with the
problems that need to be tackled.

------------------------------------------------------------------------

Hardware Configuration

Ok, lets look at the different items, one section at a time

CPU(s)

If you are on a Super Server Configuration, then with the current
Interbase design, you will not really get any benefit out of using
multiple CPU's.
This is not based upon the operating system as was previously reported,
it is due to the fact that when the multi-threading version of IB was
designed, the developers optimized for single CPU systems and as such
made decisions that do not scale to multiple CPU's well. They also
wrote the new API to only really support the threading model. So if you
are going to use super server interbase, do not waste your money on a
SMP system.

On the other hand, if you need alot of processing power, classic is the
way to go. I personally have had no need for the new services provided
by the Super Server model and I definately prefer the classic version.
I use multi-processor systems extensively and will not upgrade to any
Super Server version that does not fully support multiple CPU's
properly.

MEMORY

This depends upon wether you are using Classic or SuperServer
configurations. SuperServer has a shared memory cache, this is very
good for saving memory, it also has the metadata loaded after the first
connection to the database. With Classic, every connection starts a new
program on the server, so the program needs to load, then the metadata
needs to load - this makes Classic slower on the initial connection than
Super Server. If you have enough memory, and a good configuration,
you will not notice the speed difference. I allocate 50 MB for a RAM
disk, and I configure isc_config to use it for the first temp area (I
allocate the physical temp drive after the ram drive is full). I then
allocate 150 MB for OS and other miscellaneous items (less if Linux,
more if NT) I then allocate 15 MB per process for Super Server, 30 MB
per process for Classic. That means I would use 500 MB for a system
optimized for speed to service 20 connections (150 for OS, 50 for
RAMDISK, 300 for Super Server) or 800 MB for Classic. You can use less
than this, but, I am giving the numbers I use for the best
performance. I have seen connections use as little as 10 MB and as
much as 50 MB each under the classic configuration, but, with 30 MB set
aside for each connection, the server can handle short term high loads
without a noticable effect. The whole reason for this design is to
allow for extra capability when another member of the system fails.

DRIVES

One thing alot of people do not take into consideration are the physical
properties of the drives they use. They buy a extreamly fast drive, and
put everything (OS, SWAP, TEMP and DATA) all on the one drive. The
drive thrashing alone will slow down almost everything you do. It is
better to buy multiple cheaper drives and separate them into specific
uses. The next thing to think about is how those drives connect to
your computer.
You can get simple IDE drives. They can transfer alot of data, they can
be extreamly fast, they also can take up to 20% of your CPU's processing
power because the IDE interface is totally controlled by the CPU.
SCSI drives are extreamly good, but remember, the SCSI bus operates at
the speed of it's slowest member. That means that if you go out and
spend 80% of you budget on high end controllers and extreamly fast
drives, then put in a tape drive on the same scsi channel, you have now
made the entire unit operate as if it was designed in the 1980's.
The best configuration I found is based on a mixed system.

I put in 2 IDE drives on the first IDE channel. I fit them into hot
swap bays (around 16 bucks at your local corner computer store). I use
these drives as my boot/OS drives. The concept is that if I
misconfigure the OS in any way, I can quickly shutdown /swap the drives
and restart in as little time as possible. With NT it is difficult to
copy configurations to a second drive but there is software (I
personally use a package called ghost) that will allow you to copy a NT
boot drive onto a second drive. With Linux you can simple copy the
files as needed.
This saves you when a faulty service pack or kernel update goes wrong.

I put the CDROM onto the second IDE channel and leave it at that.

I then approach the SCSI configuration.

You can get a motherboard with onboard SCSI controllers, usually split
into two channels. The one channel will handle the older 50 pin
connectors (such as CDROM or TAPE units usually have) while the other
channel is a ultrawide fast connector.
Even if your mother board does not have a onboard controller, pick up a
controller, with or without onboard cache ram. The maximum ram I would
have on a controller like this is 32 MB. This controller would host
your SWAP and TEMP areas (preferably on separite channels, but not
absolutely neccessary).

Now it comes to your data. I personally prefer a RAID configuration. I
refuse to use software based raid. The reason for this is, many times
the problems that cause problems are due to bugs in the OS. It does not
make sense to have your data's safety net based upon the very item that
may be crashing around your ears. I like to use an intelligent SCSI
controller with multiple channels. I limit my configuration to three
drives a channel, and I try to use controllers that has at least 3
channels and a hardware cache that is expandable to at least 256 MB.
With this, you can create 3 raids, each raid consiting of 3 drives, with
each drive on a separite channel.
You can even go so far as to have multiple controllers, each with the
same configuration. If you have your database on one drive controller,
and it's shadow on another drive controller, you effectively have a
situation where you could loose 60% of your drives including one drive
controller and still not have any down time for your clients.

I will post the CPU cases and hot swap bays we use in another posting.

A tape backup device can be valuable, but does not match haveing a
second system that can be replicated to.

If you are doing direct replication, I suggest having two network cards
in each machine. If you are only doing replication between two
machines, you can use a null ethernet cable to connect the to servers
while the you connect to your internal network via the other network
cable.
If you are using multiple servers, use a dedicated etherswitch for the
servers and a second concentrator for the client connection.
You need to be good in setting up your routing, but, this will increase
your operational/replication speed and if one concentrator or switch
dies, the servers can still communicate with each other and with the
client.

The entire concept here is that you can loose one entire system or it's
support equipment without your clients realizing what has happened.

This is a very high level view of the hardware configuration, but should
be giving you ideas on how to set this sort of system up.

------------------------------------------------------------------------

Server Configuration

Ok, people will ask me what changes I have made to the cache buffers,
etc that I have made to the IB server. The truth is, other than using a
8096 page size as default, I make absolutely no changes to the basic
configuration of IB.

Here is what I do.

I put the operating system on one drive.
I place the swap space on another drive (not just a different partition
on the same drive)
I place the temp space on another drive
I place the data onto a drive or series of drives that have no other
purpose.
I create a ram disk, and setup my isc_config file to use the RAM disk
first as my primay temp space.
If I am using linux, I create another ram disk, large enough to hold the
complete interbase program. I use a bash script to copy the interbase
software onto this ram disk and link that ram disk back so that inetd
will spawn the version from the ram disk vs the version from the hard
drive.
This is still under test so I do not have hard numbers on the
performance gains that this actually gives.

With the database, I fully populate it. What I do, is after creating
the database (either from scratch or from backup) I create a tempory
table and fill it with random data until all the files that comprise the
gdb are filled. I then drop the tempory table and sweep the database.
This gets rid of file fragmentation and the server no longer needs to
ask the OS for disk space when it is inserting new records. This gives
a very signifigant speed improvment.

------------------------------------------------------------------------

Different Methods of Backing up your data

Problem One, GBAK

I have a database that is extreamly large but I only want to backup
certain tables because the majority of my database is static.
OR
I have a database that only has small amounts of changes and I want to
be able to perform incremental backups with fixed time points.
OR
My backup takes so long that from the time I start the backup to the
time my restore is finished, a few (hours/weeks/months) has passed and
so the backup is totally useless because of the differences between it
and the origional database......

OR
[INSERT GBAK COMPLAINT OF CHOICE]

Gbak is one of the biggest complaints that I have seen from large
database users. I remember that one of my first discussions with Markus
was about a database that took days to backup and restore.

Lets look at GBAK, what is it, why is it, why was it designed that
way?????

Gbak is a end user program. It is no different from the program you are
writing in [insert language of choice] to connect to interbase. It is
very simple in concept.
Gbak looks to see the on disk version of the database. Based upon what
version it is told, it reads the system tables (RDB$XXX tables). It
then writes that information out to a file stream.
It then outputs the data that is in the database out to the same file
stream. You have all sorts of choices (including a few that use the
undocumented api of IB such as not performing garbage collection) but
overall, you can take Jason's IB objects and write your own gbak with
any options you like. I say this because for a previous client I did
so. The biggest trick is to understand the order of
extraction/insertion of objects and data.
Gbak was written this way so that it can run while other users are
connected to the database. If you want a quick backup, shutdown the IB
server, and perform a OS level backup. You can also create a shadow
file, shutdown the IB server, rename the first file of the shadow file,
restart IB, connect to the server (it will complain about the shadow not
being there in the log, but as far as IB is concerned, it will just
assume the shadow disk failed and not stop the login). You can then
drop the shadow definition or create a new one. You can then connect to
the newly renamed shadow file with gfix and make it a standalone GDB
file. This allows you to have the least amount of downtime (you can
shutdown IB, rename the file, restart IB, all under a minute and even do
it in a batch file in NT). This allows you to have a snapshot of IB
that can be backed up to tape or copied to another disk using good old
OS based tools, bypassing gbak entirely.
For those of us that even a minute of downtime is too long, there is
another method.

Roll Forward Logs.

I remember a whole discussion on the mers lists about 6 months or so ago
with people complaining about no roll forward logs, and others stating
that roll forward logs are of no use.
I will state that roll forward logs are extreamly usefull, and have
saved me many a time over the past few years.

What is a roll forward log you ask???
Well a roll forward log is a complete log of all changes
(insert/update/delete) made within the database.
You can have it work in many different ways, I will describe the thought
process we went through and the mistakes we made, so that you can
understand the decissions we made and why.

Our first action was to have a series of triggers on each table of the
database (in position 0, with no other triggers at the same position)
that inserted the previous values into a shadow table that existed as an
external table. (Wait for it, I know all the red flags are up, but like
I said, I am explaining the process, not the final product yet). Well,
to begin with, external tables hold text values extreamly well, but when
you are holding date and numeric values, sometimes the casting was no
good (don't even begin to ask about blob or arrays). We could have
worked around that problem with a little work, but, that still does not
overcome the most important hurdle of that approach - EXTERNAL TABLES
ARE OUTSIDE OF TRANSACTION CONTROL..... for those of you who do not
know, the little understood entity called the transaction, is you best
friend in the whole world. Any occurence in IB is done within a
transaction, it is the nature of the beast. That means for those of you
who are using the BDE with autocommit and are inserting one record at a
time, that each record is a single transaction. This has multiple
effects. When you insert into the database, and a trigger fires,
inserting records into another table, if you rollback your work, the
other records in all the other tables that are affected are rolled back
as well.
If your roll forward log is not within transaction context, you will
have a list of changes in your log that were reversed out and so do not
exist in your main database.
This is the problem of using UDF's to record your log. After the udf
has written your changes to the external file, there is no way to have
that udf be notified that a rollback has occured.

Nope, the only way to safely create a log, is to have it be within your
database. So we are back to the format of what that log looks like.
After creating a whole series of shadow tables for each table in our
database, we found the complexity to be far to hard to maintain.
So how do we create the log?

We looked at different methods.
The first method we had was to have a table that held a very large char
field that we would have populated with a sql statement that reproduced
the insert/update or delete action that took place.
Again, this handled most datatypes, but blob and array data was out of
the question.

The second method was to have two tables, a master/detail that looked
somthing like this

MASTER table
Fields
SURROGATE_KEY = Unique key value gotten from generators
TABLE_NAME = Name of the table the action took place on
ACTION_TYPE = (I)nsert; (U)pdate;(D)elete
KEY_LINK = This is a text field that held the text
of the where clause is 'WHERE PK_FIELD1= VAL1 AND PK_FIELD2=VAL2' etc.
we took this approach so that if the table in question had multiple
fields making up it's primary key, the record could be properly found.
This value was generated by the trigger inserting these records.
TIME_STAMP = This allows us to get the order of
operations.

CHILD table
Fields
LINK_KEY = Linking record from the MASTER table
using SURROGATE_KEY
FIELD_NAME = Name of the field in question
VAL_VARCHAR
VAL_NUMBER
VAL_DATE
VAL_BLOB
etc.

So for every change to the database, we would have a record inserted
into the MASTER and a record was inserted into the child table for every
Field changed.

We were doing quite well with the second method, but found that we
needed to use some custom values, domains etc and that this caused us to
be constantly adding to the field types to be saved. This meant that we
needed to be constantly changing the metadata of a table that may
currently be in use. In addition to this, we would need to change all
sorts of stored procedures that ended up reading these tables.
We ended up with a separite table for each primay domain/datatype - this
also made it very easy to perform the upgrade to version 6 where there
were new datatypes (we are not currently running on 6 but the upgrade
testing has been performed)

Now that all the changes were logged, we needed some way to extract them
from the database. I fell back on my favorite tool, the IB_BOT.

--------------------------------------------------

To be continued.......................................