Subject | RFC: Data statistics |
---|---|

Author | Dmitry Yemanov |

Post date | 2005-01-23T09:39:48Z |

All,

As you understand, effective query optimization can be achieved only with an

adequate cost calculated. The two major criterias for the cost estimation

are cardinality and selectivity. The engine must support all required stored

statistics which can be useful to properly estimate these values. Currently

the only calculated statistics is the index selectivity, extended in FB2 to

support per-segment selectivities. This proposal describes which other types

of statistics can be used, how they should be stored and possible update

policies.

* Statistics types *

First of all, we should distinguish between data and index statistics. Data

statistics in turn consists of table level statistics (number of rows,

average row length, average page fill factor) and column level statistics

(values distribution histogram). Index statistics consists of b-tree

statistics (tree depth, average node length) and segment level statictics

(number of distinct values, number of nulls).

* Storage *

Currently index selectivities are stored on an index root page and has a

read-only duplicate in RDB$STATISTICS. Some time ago I wondered why it's

done this way and I concluded that yet another chicken and egg problem

exists here. To read the RDB$STATISTICS value for a given index the engine

must prepare a system query which required cost-based optimization, if

indices are available. Perhaps the mentioned duplication is not very nice,

but at least it works. But there's a problem here, because the maximum

number of indices per relation depends on available space on an index root

page. It means that in v2.0 you can create less indices per relation than in

v1.5. It's not a big issue yet, as a possible maximum is still big enough,

but keeping extending statistics stored on an index root page we may end

with a real limitation.

Generally, I see two ways to store statistics: low level pages or table

columns.

The former one requires the extendable storage to be designed, to avoid

limitations like described above. A possible solution would be a special

page type which may optionally exist for every index or relation. The

obvious drawback is an extra page fetch per every statistics-aware schema

object during prepare. Also, since end users should know the stored

statistical values as well, either a mirror blob in both RDB$RELATIONS and

RDB$INDICES is required or some new API call must be provided which returns

the stored statistical data to the client side.

The latter one have the same chicken and egg problem. The only solution I

can think of is a rule-based optimization for system queries. Given the fact

that a schema size is much less than data size and also that indices for

system tables are quite selective, it should work fine, although it breaks

the unified optimizer behaviour.

I'd expect either a new page type or a table blob (or both) to have a

versioned tag-based binary structure, similar to DPB/EPB/etc. I'm not sure

yet whether value distribution histograms fits this structure effectively,

more thinking is needed here.

I didn't even consider one column per every statistics value, as it tends to

require an ODS change per every new statistics type added to the engine. I

also think that a separate storage of e.g. table-level (in RDB$RELATIONS)

and column-level (in RDB$RELATION_FIELDS) statistics doesn't make much sense

either, as both can be inside RDB$RELATIONS (updated per every format

change).

Summarizing, we have three options for statistics:

1) Stored in blobs + rule-based optimization of system relations

2) Stored on a separate page + mirror read-only blobs

3) Stored on a separate page + new API call to retrieve statistical data

If anyone can think of something else, please tell.

Also, statistics must contain the datatime value of the last update.

*Calculation*

When index is created or re-activated, its statistics is being updated.

Also, you may perform this operation manually via SET STATISTICS statement.

Obviously, SET STATISTICS statement requires extention. I propose the

following:

{ SET | DROP } STATISTICS [FOR] INDEX <index name>

{ SET | DROP } STATISTICS [FOR] TABLE <table name> [<scope>]

<scope> ::= { GLOBAL [<histogram size>] | COLUMN <column name> [<histogram

size>] }

GLOBAL means computing the entire statistics, including all indices and all

columns. COLUMN allows to create distribution histograms for some definite

columns only. Default scope is GLOBAL. Other scope patterns like ALL INDICES

or ALL INDEXED COLUMNS may be also considered.

<histogram size> is a threshold value which determines when a value based

histogram is converted to a height based one and vice versa.

*GSTAT utility*

GSTAT must be modified to collect and show all supported statistics data.

It's also possible to add a switch which causes GSTAT to write the collected

statistics to a database (command-line equivalent to SQL operations).

*Usage*

If some statistics is not found, estimations are used. An example is

currently used (page_count * page_size / avg_record_len) as a table's

cardinality.

Alternatively, estimations can be also used when the stored statistics

becomes much outdated. I see two possible conditions to be applied here.

First, we may look at the statistics update timestamp and decide that some

[configurable] time interval means outdated data. Second, we may check the

estimated data size and decide that some difference (e.g. 10K pages in

statistics vs 50K pages existing in PIP) also means outdated statistics.

I'd like to support the proposed infrastructure in the next major FB

version, extending statistics data when required.

Comments, please.

Dmitry

As you understand, effective query optimization can be achieved only with an

adequate cost calculated. The two major criterias for the cost estimation

are cardinality and selectivity. The engine must support all required stored

statistics which can be useful to properly estimate these values. Currently

the only calculated statistics is the index selectivity, extended in FB2 to

support per-segment selectivities. This proposal describes which other types

of statistics can be used, how they should be stored and possible update

policies.

* Statistics types *

First of all, we should distinguish between data and index statistics. Data

statistics in turn consists of table level statistics (number of rows,

average row length, average page fill factor) and column level statistics

(values distribution histogram). Index statistics consists of b-tree

statistics (tree depth, average node length) and segment level statictics

(number of distinct values, number of nulls).

* Storage *

Currently index selectivities are stored on an index root page and has a

read-only duplicate in RDB$STATISTICS. Some time ago I wondered why it's

done this way and I concluded that yet another chicken and egg problem

exists here. To read the RDB$STATISTICS value for a given index the engine

must prepare a system query which required cost-based optimization, if

indices are available. Perhaps the mentioned duplication is not very nice,

but at least it works. But there's a problem here, because the maximum

number of indices per relation depends on available space on an index root

page. It means that in v2.0 you can create less indices per relation than in

v1.5. It's not a big issue yet, as a possible maximum is still big enough,

but keeping extending statistics stored on an index root page we may end

with a real limitation.

Generally, I see two ways to store statistics: low level pages or table

columns.

The former one requires the extendable storage to be designed, to avoid

limitations like described above. A possible solution would be a special

page type which may optionally exist for every index or relation. The

obvious drawback is an extra page fetch per every statistics-aware schema

object during prepare. Also, since end users should know the stored

statistical values as well, either a mirror blob in both RDB$RELATIONS and

RDB$INDICES is required or some new API call must be provided which returns

the stored statistical data to the client side.

The latter one have the same chicken and egg problem. The only solution I

can think of is a rule-based optimization for system queries. Given the fact

that a schema size is much less than data size and also that indices for

system tables are quite selective, it should work fine, although it breaks

the unified optimizer behaviour.

I'd expect either a new page type or a table blob (or both) to have a

versioned tag-based binary structure, similar to DPB/EPB/etc. I'm not sure

yet whether value distribution histograms fits this structure effectively,

more thinking is needed here.

I didn't even consider one column per every statistics value, as it tends to

require an ODS change per every new statistics type added to the engine. I

also think that a separate storage of e.g. table-level (in RDB$RELATIONS)

and column-level (in RDB$RELATION_FIELDS) statistics doesn't make much sense

either, as both can be inside RDB$RELATIONS (updated per every format

change).

Summarizing, we have three options for statistics:

1) Stored in blobs + rule-based optimization of system relations

2) Stored on a separate page + mirror read-only blobs

3) Stored on a separate page + new API call to retrieve statistical data

If anyone can think of something else, please tell.

Also, statistics must contain the datatime value of the last update.

*Calculation*

When index is created or re-activated, its statistics is being updated.

Also, you may perform this operation manually via SET STATISTICS statement.

Obviously, SET STATISTICS statement requires extention. I propose the

following:

{ SET | DROP } STATISTICS [FOR] INDEX <index name>

{ SET | DROP } STATISTICS [FOR] TABLE <table name> [<scope>]

<scope> ::= { GLOBAL [<histogram size>] | COLUMN <column name> [<histogram

size>] }

GLOBAL means computing the entire statistics, including all indices and all

columns. COLUMN allows to create distribution histograms for some definite

columns only. Default scope is GLOBAL. Other scope patterns like ALL INDICES

or ALL INDEXED COLUMNS may be also considered.

<histogram size> is a threshold value which determines when a value based

histogram is converted to a height based one and vice versa.

*GSTAT utility*

GSTAT must be modified to collect and show all supported statistics data.

It's also possible to add a switch which causes GSTAT to write the collected

statistics to a database (command-line equivalent to SQL operations).

*Usage*

If some statistics is not found, estimations are used. An example is

currently used (page_count * page_size / avg_record_len) as a table's

cardinality.

Alternatively, estimations can be also used when the stored statistics

becomes much outdated. I see two possible conditions to be applied here.

First, we may look at the statistics update timestamp and decide that some

[configurable] time interval means outdated data. Second, we may check the

estimated data size and decide that some difference (e.g. 10K pages in

statistics vs 50K pages existing in PIP) also means outdated statistics.

I'd like to support the proposed infrastructure in the next major FB

version, extending statistics data when required.

Comments, please.

Dmitry