Subject Re: RFC: RDB$COUNT
Author Adam
--- In Firebird-Architect@yahoogroups.com, "= m. Th =" <th@...> wrote:
>
> Hi,
>
> We know how costly is Count(*) and this isn't (mainly) a problem
when
> someone does a Select count(*) from table1;
> The main problem AFAIK is the optimizer which must know the
cardinality
> of each relation/table which must be joined and, also, the indexes
must
> have their selectivity updated (this is the ideal case).
>
> Perhaps is better to add in RDB$RELATIONS a new column RDB$COUNT
(type
> BIGINT) which holds the cardinality of a table (ie. the number of
the
> *commited* records). This will update at the end of each
transaction
> with something like New.RDB$COUNT=Old.RDB$Count+_ROWSAFFECTED(from
> inserts)-_ROWSAFFECTED(from deletes). The rows affected must be
stored
> in a array, let's say, aRowsAffected with 2 columns (insert and
delete)
> and [Select count(*) from rdb$relations]. For each active
transaction
> will be allocated such an array. Or other sollution (perhaps is
better):
> a single (global) array (let's call it RDB$DELTAS) with 4 columns:
TID,
> RDB$Relation_ID, InsertCount, DeleteCount.
>
> So the real number of rows for table T1 in the transaction ID TID1
will
> be, expressed in SQL way:
> RDB$RELATIONS.RDB$COUNT
> + Select InsertCount from RDB$DELTAS where TID=TID1 and
> RDB$Relation_ID=<relation id for the table T1>
> - Select DeleteCount from RDB$DELTAS where TID=TID1 and
> RDB$Relation_ID=<relation id for the table T1> ;
>
> On commit, the RDB$COUNT will be updated and the
InsertCount/DeleteCount
> values will be reset. On a rollback, only the values will be reset.
>
> Result:
> 1. The optimizer will know in any moment the cardinality of
relations
> from its transaction context, thus choosing (in a faster way) a
better plan.
> 2. SET STATISTICS <index> will be obsolete, reducing once more the
> administration costs.
> 3. If you really want to optimize select count(*) from table1...
>

I would say a significant cost of count(*) is that it does garbage
collection as it passes each record (in fact many applications that
must do bulk deletes rely on such logic to remove the garbage).

I can not answer if your methodology will work or not. There is
obviously going to be an extra overhead when inserting, deleting,
committing, and rolling back for something that will only answer a
single question, what is the total row count. It does not help you to
determine something like

select count(*)
from allrecords
where processedtime is null;

Keep in mind too that most systems designed for MGA do not try to
substitute '(count(*) > 0)' for 'exists'. Perhaps more benefit could
be obtained by the optimiser recognising this common mistake and
automatically substituting the exists logic in such cases.

I don't think I have any code that selects an unbounded count on a
table with more than a couple of records, so the performance hit
doesn't particularly bother me, especially when I consider the
benefit of readers not blocking writers and writers not blocking
readers.

Adam