Subject | Re: [firebird-support] Re: question about optimizing group by |
---|---|
Author | Ann W. Harrison |
Post date | 2008-11-21T22:23:43Z |
woodsmailbox wrote:
and company_name and that company_id is the primary key of companies.
query would not use an index on company because there's no restriction
on it, and the optimizer will always choose the smaller table as the
inner loop of a join. So you'll do a natural scan of companies and
an indexed access to company_employees on company_id, then sort the
result by company_id, company_name. The sort comparison will check
a few extra bytes... which costs nothing compared with reading every
row in each table.
first, there's no need, since the group is sitting there in the sort
buffer. There will be a sort of the group, which could have some
cost if the groups are very large. On the other hand, the system
might be smart enough to sort by both company_id and company_name,
in which case the cost is the same as the primary case.
One thing to remember is that sorting data in memory is much faster
than reading it off disk, despite the nlog(n) factor.
an ON clause and do a full cross product ... wrong answer as well as
very slow. The optimizer may be good enough to evaluate the select
subquery first, then lookup companies by primary key. But I wouldn't
bet on it.
function, this one isn't much better than the original.
Here's a hint from some years of experience developing several database
systems. Optimizers are good at handling simple statements. The more
you try clever stuff, the more likely they are to make mistakes.
Best,
Ann
>I'm going to assume that there's a 1:1 relationship between company_id
> Below are 4 equivalent selects. Question is, wouldn't the last query
> perform faster than the others?
and company_name and that company_id is the primary key of companies.
> selectCarrying the company name around in the sort costs almost nothing - this
> c.company_id,
> c.company_name,
> count(1) as employee_count
> from
> companies c
> inner join company_employees ce on ce.company_id = c.company_id
> group by
> c.company_id,
> c.company_name -- SUPERFLOUS INCLUSION OF company_name in SORT
> TABLE or INDEX SCAN
query would not use an index on company because there's no restriction
on it, and the optimizer will always choose the smaller table as the
inner loop of a join. So you'll do a natural scan of companies and
an indexed access to company_employees on company_id, then sort the
result by company_id, company_name. The sort comparison will check
a few extra bytes... which costs nothing compared with reading every
row in each table.
>There's no index lookup of company_name to resolve this aggregate -
> ALTERNATIVE SYNTAX #1
>
> select
> c.company_id,
> min(c.company_name), -- SUPERFLOUS INDEX SCAN TO COMPUTE min()
> count(1) as employee_count
> from
> companies c
> inner join company_employees ce on ce.company_id = c.company_id
> group by
> c.company_id
first, there's no need, since the group is sitting there in the sort
buffer. There will be a sort of the group, which could have some
cost if the groups are very large. On the other hand, the system
might be smart enough to sort by both company_id and company_name,
in which case the cost is the same as the primary case.
One thing to remember is that sorting data in memory is much faster
than reading it off disk, despite the nlog(n) factor.
>Now this one is likely to be expensive, especially if you forget to add
> ALTERNATIVE SYNTAX #2
>
> select
> c.company_id,
> c.company_name,
> t.employee_count
> from
> companies c
> inner join (select company_id, count(1) as employee_count from company_employees group by company_id)
> group by
> c.company_id
an ON clause and do a full cross product ... wrong answer as well as
very slow. The optimizer may be good enough to evaluate the select
subquery first, then lookup companies by primary key. But I wouldn't
bet on it.
>Aside from the time it will take to implement the hypothetical first
> ALTERNATIVE SYNTAX #3 with a hypothetical first() aggregate function:
>
> select
> c.company_id,
> first(c.company_name),
> count(1) as employee_count
> from
> companies c
> inner join company_employees ce on ce.company_id = c.company_id
> group by
> c.company_id
>
function, this one isn't much better than the original.
Here's a hint from some years of experience developing several database
systems. Optimizers are good at handling simple statements. The more
you try clever stuff, the more likely they are to make mistakes.
Best,
Ann