firebird-support - Re: [firebird-support] Re: question about optimizing group by

Subject	Re: [firebird-support] Re: question about optimizing group by
Author	Ann W. Harrison
Post date	2008-11-24T16:14:26Z

woodsmailbox,

>
>
> Carrying the company name around in the sort costs almost nothing -
> this (...) The sort comparison will check a few extra bytes... which
> costs nothing compared with reading every row in each table.
>
> A real life query would involve many more columns than just
> company_name in the group-by list, and it would generally include a
> where clause. Does that change the story for the optimizer? i.e. would
> the large column number in the group-by affect speed then?

I guess it might, depending on what you mean by "many more". A WHERE
clause (or anything else that limits the number of rows returned
before the group by) will make the sort faster, regardless of the
size of the key.

> could it hit any limits for sort tables, etc.?

I doubt it. As long as you have space in /tmp, the sort will keep
building segments.

>
>> One thing to remember is that sorting data in memory is much faster
>> than reading it off disk, despite the nlog(n) factor.
>
> Assume the db and tmp directory sits on a ramdrive, so I only worry
> about the actual optimization effort.

Sigh. If you will keep introducing new dimensions to the problem,
my answers will get less and less useful. The Firebird optimizer is
predicated on a disk resident database, where the overwhelming cost
of query processing is I/O. Ram disks change that significantly,
though there are still differences between the costs of access to
different memory sources.

Basically, if you're planning on releasing on RAM disk, spend the
time to benchmark and tune your queries against your data rather
then speculating.

>
>> Aside from the time it will take to implement the hypothetical first
>> function, this one isn't much better than the original.
>
> Why?

You have to grab all the values out of the database, even though they're
duplicates because the system can't know that they're duplicates. You
may be thinking the system first does the join, finding only matching
keys, then does the group by, keeping only the grouping fields, then
looks up other values. In fact, during the join, Firebird builds an
internal table containing all the field that are referenced by the
query. It then sorts that table to produce the grouping, and finally
does any aggregation needed within a group.

This makes a lot of sense if the cost of moving records from the
database is much higher than the cost of managing records in memory.
RAM disk suggests different assumptions of cost.

If your groups consist of 10-20 records a FIRST (the hypothetical
random aggregate) doesn't save much over a sort. If the groups are
10,000 or more, it saves a lot.

>
> And why I'm asking: If indeed, the speed isn't affected by the number
> and size of columns in the group-by clause (and whether there are
> compound indexes or not on those columns), then I wouldn't be looking
> for a solution. But that's exactly what I'm not convinced of yet,
> since I got quite different results with the different alternatives I
> tried.

Basically, your environment is very different from the environment for
which Falcon was developed - so benchmarks of your queries on your data
will be more helpful than any theories.

Good luck,

Ann