Subject RFC: Data statistics
Author Dmitry Yemanov
All,

As you understand, effective query optimization can be achieved only with an
adequate cost calculated. The two major criterias for the cost estimation
are cardinality and selectivity. The engine must support all required stored
statistics which can be useful to properly estimate these values. Currently
the only calculated statistics is the index selectivity, extended in FB2 to
support per-segment selectivities. This proposal describes which other types
of statistics can be used, how they should be stored and possible update
policies.

* Statistics types *

First of all, we should distinguish between data and index statistics. Data
statistics in turn consists of table level statistics (number of rows,
average row length, average page fill factor) and column level statistics
(values distribution histogram). Index statistics consists of b-tree
statistics (tree depth, average node length) and segment level statictics
(number of distinct values, number of nulls).

* Storage *

Currently index selectivities are stored on an index root page and has a
read-only duplicate in RDB$STATISTICS. Some time ago I wondered why it's
done this way and I concluded that yet another chicken and egg problem
exists here. To read the RDB$STATISTICS value for a given index the engine
must prepare a system query which required cost-based optimization, if
indices are available. Perhaps the mentioned duplication is not very nice,
but at least it works. But there's a problem here, because the maximum
number of indices per relation depends on available space on an index root
page. It means that in v2.0 you can create less indices per relation than in
v1.5. It's not a big issue yet, as a possible maximum is still big enough,
but keeping extending statistics stored on an index root page we may end
with a real limitation.

Generally, I see two ways to store statistics: low level pages or table
columns.

The former one requires the extendable storage to be designed, to avoid
limitations like described above. A possible solution would be a special
page type which may optionally exist for every index or relation. The
obvious drawback is an extra page fetch per every statistics-aware schema
object during prepare. Also, since end users should know the stored
statistical values as well, either a mirror blob in both RDB$RELATIONS and
RDB$INDICES is required or some new API call must be provided which returns
the stored statistical data to the client side.

The latter one have the same chicken and egg problem. The only solution I
can think of is a rule-based optimization for system queries. Given the fact
that a schema size is much less than data size and also that indices for
system tables are quite selective, it should work fine, although it breaks
the unified optimizer behaviour.

I'd expect either a new page type or a table blob (or both) to have a
versioned tag-based binary structure, similar to DPB/EPB/etc. I'm not sure
yet whether value distribution histograms fits this structure effectively,
more thinking is needed here.

I didn't even consider one column per every statistics value, as it tends to
require an ODS change per every new statistics type added to the engine. I
also think that a separate storage of e.g. table-level (in RDB$RELATIONS)
and column-level (in RDB$RELATION_FIELDS) statistics doesn't make much sense
either, as both can be inside RDB$RELATIONS (updated per every format
change).

Summarizing, we have three options for statistics:

1) Stored in blobs + rule-based optimization of system relations
2) Stored on a separate page + mirror read-only blobs
3) Stored on a separate page + new API call to retrieve statistical data

If anyone can think of something else, please tell.

Also, statistics must contain the datatime value of the last update.

*Calculation*

When index is created or re-activated, its statistics is being updated.
Also, you may perform this operation manually via SET STATISTICS statement.

Obviously, SET STATISTICS statement requires extention. I propose the
following:

{ SET | DROP } STATISTICS [FOR] INDEX <index name>
{ SET | DROP } STATISTICS [FOR] TABLE <table name> [<scope>]
<scope> ::= { GLOBAL [<histogram size>] | COLUMN <column name> [<histogram
size>] }

GLOBAL means computing the entire statistics, including all indices and all
columns. COLUMN allows to create distribution histograms for some definite
columns only. Default scope is GLOBAL. Other scope patterns like ALL INDICES
or ALL INDEXED COLUMNS may be also considered.

<histogram size> is a threshold value which determines when a value based
histogram is converted to a height based one and vice versa.

*GSTAT utility*

GSTAT must be modified to collect and show all supported statistics data.
It's also possible to add a switch which causes GSTAT to write the collected
statistics to a database (command-line equivalent to SQL operations).

*Usage*

If some statistics is not found, estimations are used. An example is
currently used (page_count * page_size / avg_record_len) as a table's
cardinality.

Alternatively, estimations can be also used when the stored statistics
becomes much outdated. I see two possible conditions to be applied here.
First, we may look at the statistics update timestamp and decide that some
[configurable] time interval means outdated data. Second, we may check the
estimated data size and decide that some difference (e.g. 10K pages in
statistics vs 50K pages existing in PIP) also means outdated statistics.

I'd like to support the proposed infrastructure in the next major FB
version, extending statistics data when required.

Comments, please.


Dmitry