firebird-support - RE: [ib-support] Re: Geographic data

Subject	RE: [ib-support] Re: Geographic data
Author	Ann W. Harrison
Post date	2002-10-08T16:50:10Z

At 08:24 AM 10/8/2002 +1000, Alan McDonald wrote:

>Ann,
>can you explain what SET STATISTICS and this computed value does? How does
>it speed up selects?

In theory, Firebird uses a cost based optimizer. The major function
of the optimizer is to pick the optimal join order and the appropriate
indexes to support that order. There are a number of types of
join-order optimizers. The simplest is the "syntactic" optimizer,
which uses the order in which the tables were listed in the query.
The most intuitive is a "rule based" optimizer which understands
things like primary key and foreign key navigation paths and looks
for patterns that it likes. The best (in my opinion, but who's
opinion would you expect from me?) is a cost-based optimizer which
evaluates different join orders and calculates which is likely to
have the smallest intermediate product*.

Firebird has the remains of a cost-based optimizer with some rule
based stuff grafted on. It uses two types of information to compute
the cost of a join: the number of records in the table and the
selectivity of the index. It gets the approximate number of records
by counting the number of data pages in the table. That's only
an approximation, of course, but the difference (more small records
or fewer large records) is offset by the fact that the cost is
not in the number of records (generally) but in the number of page
reads.

When an index is created, recreated, or activated, Firebird computes
the selectivity (number of entries versus the number of distinct
values) and stores it in the system table that describes the index.
SET SELECTIVITY does the same thing, without rebuilding the index -
a considerable saving. The selectivity is NOT reset when entries
are added to the index or deleted from it.

How does all this work? Assume that all the referenced fields
are indexed in this query.

SELECT C.NAME, O.ORDER_NUMBER
FROM CUSTOMERS C
JOIN ORDERS 0 ON O.CUST_ID = C.CUST_ID
JOIN ORDER_ITEMS OI ON OI.ORDER_ID = O.ORDER_ID
WHERE C.STATE = 'MA' AND OI.STATUS = 'BACK ORDER';

The six possible join orders are C->O->OI, C->OI->O, O->C->OI
O->OI->C, OI->C->O, and OI->0->C. Within those are several
possible index paths...

Since you and I are humans
and work reasonably well with rules, we'd probably eliminate
everything but C->O->OI and OI->O->C. A cost based system
would evaluate the options like this:

C->O->OI cost is size of customers times selectivity of the
state index on customers (250,000 records * 1/50 = 50,0000)
times the selectivity of the customer_id index on orders
times the size of orders (500,000 * 1/250,000 = 2) times
the selectivity of the order_id index on order_items times
the size of order_items (1,500,000 * 1 / 500,000 = 3).
That gives a cost of 150,000.

... and so on.

Random thoughts

A more sophisticated system might compute separate selectivities
for the pieces of a compound index (e.g. for an index on A, B, C,
it would have a selectivity for A and AB as well as ABC).

It's also possible to get a selectivity histogram that represents
the number of values that fall within particular ranges. That's
less useful for Firebird because it is data sensitive. Firebird's
optimizer runs during request compilation, not during execution,
so it doesn't have the actual data values for the request. A
histogram might, however, be used to recognize certain anomalous
data distributions like a table that has a "date due" field that
has the value "PAID" for 90% of the entries. The index is valuable
if you're looking for records with a data due less than today, but
not if you're looking for records of payments that have already
been made.

* The intermediate product is the number of pages that must be read
to produce the desired result.

Regards,

Ann
www.ibphoenix.com
We have answers.