Subject | Re: [Firebird-Architect] RFC: RDB$COUNT |
---|---|
Author | Ann W. Harrison |
Post date | 2006-10-28T18:13:28Z |
= m. Th = wrote:
the index, as it can in many non-MVCC systems. The index doesn't
contain enough information to determine whether a particular record
is appropriate for a particular transaction. This applies to counts
of full tables and counts restricted by indexed fields.
by reading the pointer pages to get the total number of data pages.
The result is not precise - a data page may be mostly empty or full
of back versions of records and records are compressed for storage.
However the inaccuracies tend to cancel or reflect the actual cost
that the optimizer needs, since it's trying to minimize the number
of page reads - a page read costs the same whether there's on
record on the page or one hundred.
in the business, lots of databases kept counts. A bit of probing
uncovered the fact that the count fields were serious hot-spots
since every committing transaction had to update the same field.
transaction context?
If it updates in the system context, the result can't be used to
support a count (*) from a snapshot transaction because the value
may reflect changes made after the transaction started - and can
change while the transaction is running.
Of course, the plus side of updating in the system context is that
it works, which updating in the transaction context doesn't, since
a snapshot transaction won't see changes made while it is running
and will get an update conflict if a concurrent transaction has
update the value. Basically serializes all inserts and deletes.
automatically maintained.
of tables. Nothing you've mentioned touches on that issue.
is a trick used by lazy people creating benchmarks. It really
doesn't have a lot of real world applicability.
Regards,
Ann
>The reason count (*) is expensive is that it can't be resolved from
> We know how costly is Count(*) and this isn't (mainly) a problem when
> someone does a Select count(*) from table1;
the index, as it can in many non-MVCC systems. The index doesn't
contain enough information to determine whether a particular record
is appropriate for a particular transaction. This applies to counts
of full tables and counts restricted by indexed fields.
> The main problem AFAIK is the optimizer which must know the cardinalityActually, the optimizer gets a good estimate of the size of a table
> of each relation/table which must be joined and, also, the indexes must
> have their selectivity updated (this is the ideal case).
by reading the pointer pages to get the total number of data pages.
The result is not precise - a data page may be mostly empty or full
of back versions of records and records are compressed for storage.
However the inaccuracies tend to cancel or reflect the actual cost
that the optimizer needs, since it's trying to minimize the number
of page reads - a page read costs the same whether there's on
record on the page or one hundred.
>Now there's a really bad idea, but not a new one. When I started
> Perhaps is better to add in RDB$RELATIONS a new column RDB$COUNT (type
> BIGINT) which holds the cardinality of a table (ie. the number of the
> *commited* records).
in the business, lots of databases kept counts. A bit of probing
uncovered the fact that the count fields were serious hot-spots
since every committing transaction had to update the same field.
> This will update at the end of each transactionThat's interesting. Does it update in the system context or the
> with something like New.RDB$COUNT=Old.RDB$Count+_ROWSAFFECTED(from
> inserts)-_ROWSAFFECTED(from deletes).
transaction context?
If it updates in the system context, the result can't be used to
support a count (*) from a snapshot transaction because the value
may reflect changes made after the transaction started - and can
change while the transaction is running.
Of course, the plus side of updating in the system context is that
it works, which updating in the transaction context doesn't, since
a snapshot transaction won't see changes made while it is running
and will get an update conflict if a concurrent transaction has
update the value. Basically serializes all inserts and deletes.
>The optimizer already has a good estimate of cardinality which is
> Result:
> 1. The optimizer will know in any moment the cardinality of relations
> from its transaction context, thus choosing (in a faster way) a better plan.
automatically maintained.
> 2. SET STATISTICS <index> will be obsolete, reducing once more theStatistics refers to the selectivity of indexes, not the cardinality
> administration costs.
of tables. Nothing you've mentioned touches on that issue.
> 3. If you really want to optimize select count(*) from table1...This isn't a problem worth solving. Counting whole tables
>
is a trick used by lazy people creating benchmarks. It really
doesn't have a lot of real world applicability.
Regards,
Ann