Subject Re: [firebird-support] Re: question about optimizing group by
Author Ann W. Harrison
woodsmailbox wrote:
>
> Below are 4 equivalent selects. Question is, wouldn't the last query
> perform faster than the others?

I'm going to assume that there's a 1:1 relationship between company_id
and company_name and that company_id is the primary key of companies.

> select
> c.company_id,
> c.company_name,
> count(1) as employee_count
> from
> companies c
> inner join company_employees ce on ce.company_id = c.company_id
> group by
> c.company_id,
> c.company_name -- SUPERFLOUS INCLUSION OF company_name in SORT
> TABLE or INDEX SCAN

Carrying the company name around in the sort costs almost nothing - this
query would not use an index on company because there's no restriction
on it, and the optimizer will always choose the smaller table as the
inner loop of a join. So you'll do a natural scan of companies and
an indexed access to company_employees on company_id, then sort the
result by company_id, company_name. The sort comparison will check
a few extra bytes... which costs nothing compared with reading every
row in each table.


>
> ALTERNATIVE SYNTAX #1
>
> select
> c.company_id,
> min(c.company_name), -- SUPERFLOUS INDEX SCAN TO COMPUTE min()
> count(1) as employee_count
> from
> companies c
> inner join company_employees ce on ce.company_id = c.company_id
> group by
> c.company_id

There's no index lookup of company_name to resolve this aggregate -
first, there's no need, since the group is sitting there in the sort
buffer. There will be a sort of the group, which could have some
cost if the groups are very large. On the other hand, the system
might be smart enough to sort by both company_id and company_name,
in which case the cost is the same as the primary case.

One thing to remember is that sorting data in memory is much faster
than reading it off disk, despite the nlog(n) factor.

>
> ALTERNATIVE SYNTAX #2
>
> select
> c.company_id,
> c.company_name,
> t.employee_count
> from
> companies c
> inner join (select company_id, count(1) as employee_count from company_employees group by company_id)
> group by
> c.company_id

Now this one is likely to be expensive, especially if you forget to add
an ON clause and do a full cross product ... wrong answer as well as
very slow. The optimizer may be good enough to evaluate the select
subquery first, then lookup companies by primary key. But I wouldn't
bet on it.

>
> ALTERNATIVE SYNTAX #3 with a hypothetical first() aggregate function:
>
> select
> c.company_id,
> first(c.company_name),
> count(1) as employee_count
> from
> companies c
> inner join company_employees ce on ce.company_id = c.company_id
> group by
> c.company_id
>

Aside from the time it will take to implement the hypothetical first
function, this one isn't much better than the original.

Here's a hint from some years of experience developing several database
systems. Optimizers are good at handling simple statements. The more
you try clever stuff, the more likely they are to make mistakes.

Best,

Ann