Subject RE: [firebird-support] Slow performance with Index
Author Edwin A. Epstein, III
>>First, no professional programmer would ever send 7 million records
>>across a network to a client application, so I'm going to assume that
>>whatever you're doing is happening on the server.

Welllll. Not 7 million records across the network but between 100K and 2
million records across the network. I don't really see that I have a
choice. I have to export those records one way or another for processing. I
am using a 100 mbit network and a machine right next to the server to obtain
these records. I am up for suggestions on the fastest way.

There are huge number of duplicates. I will not ever have to query the
database for records with a flag value = 0, only records with a flag value =
1.

According to your definition I have an extremely volatile database. An
average of 1 million records per week being inserted into the datbase. In
addition to the insert, updates are happening on the records in about the
same amount. The DNC_QUED field I am using to indicate that the record
needs further processing outside of the server. This does not happen only
one time. This may happen multiple times on the same record in a monthly
period.

-----Original Message-----
From: Ann W. Harrison [mailto:aharrison@...]
Sent: Thursday, February 03, 2005 9:06 AM
To: firebird-support@yahoogroups.com
Subject: Re: [firebird-support] Slow performance with Index



Edwin A. Epstein, III wrote:
> I have a table with 14 million records in it. As part of certain
operations
> I need to do against the records I flag them individually as part of a
que.
> I am using a field called DNC_QUED which is a VarChar(1) and ASCII
character
> set. I have non-unique index on just the field alone. There are no null
> values and the only values are '1' or '0'.

First, no professional programmer would ever send 7 million records
across a network to a client application, so I'm going to assume that
whatever you're doing is happening on the server.

The elementary rules about indexes include not indexing fields that
aren't very selective - the example given is normally gender. There are
counter examples - a gender index is useful when searching for females
in the set of military officers prior to 1920. Oracle has a bitmap
index designed to help in indexing values with huge numbers of
duplicates. But you're using Firebird, so let me explain why this index
can be a good idea in a few cases and why it's a bad idea in other cases.

If you have very few (a few tens of thousand out of 14 million) records
with DNC_QUED = 1, and want to find those records, the index will help.
That's good , but doesn't generalize all that well, since from time to
time you will need to get the other records and the index will hurt
performance. You can tell Firebird not to use index in one of two ways:
invert the equality (i.e. DNC_QUED NOT = 1), or turn the value into an
expression (e.g. DNC_QUED + 0 = 0).

If you know that the records are clustered by value, the index will
improve performance by restricting the number of pages you need to read

What does that mean and how could you know that? When you insert a row,
Firebird puts it in the most convenient empty space. Space is
convenient if it is on a page that is currently in the cache, if it's on
a page allocated to the table and not already full, if it's on a page
allocated to the database but not currently in use. Otherwise, Firebird
extends the database file and allocates a new page to a table. If
you've got a volatile database - lots of records stored, modified, and
erased - records will tend to be all over the place. If your database
is more archival, mostly storing, very little modification or erasing,
then newer records will be at the end and older records at the beginning.

So, for a volatile database, records won't tend to be clustered by age,
but for an archival database, they will. If you know that older records
will have one value for DNC_QUED and newer records will have the other
value, then there's hope that the index will do you some good because
all the records with one value will be on one set of pages and all the
other pages will contain only records of the other value.

The cost of a retrieval is basically the number of pages you have to
read. If every page of the table has some data you need, a simple table
scan is faster than reading the index, constructing a bit map of the
records that match the criteria, then reading every page.

But in general, don't index two-valued fields.

Regards,


Ann






Yahoo! Groups Links