Subject Re: [firebird-support] Optimizer not very good at IN/EXISTS
Author Svein Erling Tysvaer
Hi Mathias, I'll comment your queries within the text as they appear.

Mathias Dellaert wrote:
> Greetings,
>
> I've noticed that the optimizer seems to be very inefficient when
> working with IN/EXISTS clauses. Here's an example
>
> SELECT * FROM SALES S
> WHERE SID IN (SELECT MIN(SID)
> FROM SALES S2
> GROUP BY Product)
>
> Uses plan:
>
> PLAN (S2 ORDER IDX_SALES_BASE)
> PLAN (S NATURAL)

One deficit with Firebird 1.5 is that if never showed ORDER ... INDEX
... for any table in the plan ('table' meaning 'alias'). I'm far from
certain that it doesn't use the S2.SID index under the hood. In general,
seeing any GROUP BY within a subselect ought to make you see if the
query can be rewritten one way or the other (I don't think I've never
had a GROUP BY in a subselect).

> (IDX_SALES_BASE is an index over the product field, SID is the primary
> key)
>
> Similarly, using an EXISTS:
>
> select * from SALES S
> WHERE EXISTS (SELECT Product
> FROM SALES S2
> GROUP BY Product
> HAVING MIN(S2.SID) = S.SID )
>
> Uses the same plan.

Yup, that should be the same query. A far better way to do this (in my
opinion) would be

SELECT * FROM SALES S
WHERE NOT EXISTS(SELECT * FROM SALES S2
WHERE S2.PRODUCT = S.PRODUCT
AND S2.SID < S.SID)

> Where-as the "equivalent" query (in Firebird 2.0 RC2)
>
> select Sales.* from
> SALES,
> (SELECT MIN(SID) as SID
> FROM SALES
> GROUP BY product) as Blah
> WHERE Sales.SID = Blah.SID
>
> Uses plan:
>
> PLAN JOIN (BLAH SALES ORDER IDX_SALES_BASE, SALES INDEX (PK_SALES))
>
> Note how this one actually uses the primary key.

Here Firebird is free to evaluate s2 before s, hence it doesn't have to
generate the subselect for every potential row (at least I think
Firebird has to do the subselect in Firebird 1.5 as many times as there
are rows in the SALES table).

> Another note: simpler queries generate similar plans, like (and yes I
> realise this is a silly query)
>
> SELECT * FROM SALES S
> WHERE SID IN (SELECT SID
> FROM SALES S2
> WHERE product='PC')
>
> Which uses
>
> PLAN (S2 INDEX (PK_SALES))
> PLAN (S NATURAL)
>
> Which seems even less efficient to me (since it doesn't even use the
> product index).

No, that's in my opinion a decent choice of indexes. An ideal optimizer
could be able to figure out that it in this query simply could eliminate
the subselect altogether, but in theory the subselect could refer to a
field in the outer select in its WHERE clause, so it is understandable
that S2 has to be executed for each row in S. Since it at this point has
the PK available, adding the product index would only slow things down.

The one thing that I don't think you quite grasp, is that a subselect is
executed as many times as there are rows in the 'outer' select (well, at
least it used to do that). Say your SALES table has 10000 rows. The
subselect is then executed 10000 times, and if you have 100 products,
that would means 10000000 'subresults'.

> For comparison:
>
> SELECT SID FROM SALES S2 WHERE product = 'PC'
>
> Uses
>
> PLAN (S2 INDEX (IDX_SALES_BASE))
>
> Now my question: are there any plans to improve the behaviour of the
> IN/EXISTS clause? Is this a bug? It seems to me that now Firebird 2 has
> greater subquery support, there's plenty of room for improvement here.

No, it is not a bug. It wouldn't surprise me if there still is room for
improvement, but the queries you supplied is obscure enough not to not
make it to my wish list (if I'd had any).

Set