Subject Re: The IN(select...) predicate seems slow
Author Svein Erling Tysvær
--- In firebird-support@yahoogroups.com, "Greg At ACD" wrote:
> Platform: WinXP SP2
> DB Version: Firebird 1.5.1 Superserver
>
> I am having some interesting results with the IN () clause,
> particularly if there's a subquery within the IN clause.
>
> For example, I have an ASSET table with 100,000 records in it. The
> column ASSETID is a primary key.
>
> The following query:
>
> SELECT ASSET.ASSETID
> FROM ASSET
> WHERE ASSET.ASSETID in (36);
>
> This returns very quickly. Not surprisingly, the resulting plan is
> as follows:
> PLAN (ASSET INDEX (PK_ASSET))
>
> However, if I rework the query as follows:
>
> SELECT ASSET.ASSETID
> FROM ASSET
> WHERE ASSET.ASSETID in
> (select 36 from rdb$database);
>
> The query is MUCH slower, and the resulting plan is as follows:
> PLAN (ASSET NATURAL)
> PLAN (RDB$DATABASE NATURAL)
>
> So, it appears that the query is is doing a table scan for ASSETID =
> 36 instead of using the PK index.
>
> Note that the query is purposely simplified to show the issue I'm
> having. In practice, the query would be something to this affect:
>
> SELECT ASSET.ASSETID, ASSET.NAME, ASSET.FOLDERNAME
> FROM ASSET
> WHERE ASSET.ASSETID
> IN (SELECT DISTINCT JOINCATEGORYASSET.ASSETID
> FROM JOINCATEGORYASSET
> WHERE JOINCATEGORYASSET.CATEGORYID IN (14,15));
>
> So, in this case, the query is exceedingly slow since the subquery
> returns 2000 items, and it appears that there's a table scan of the
> asset table happening for each item (the query takes about 20
> seconds to return).
>
> I could remove the subquery altogether and just join everything up
> as well:
>
> SELECT distinct ASSET.ASSETID, ASSET.NAME, ASSET.FOLDERNAME
> FROM ASSET, JOINCATEGORYASSET
> WHERE ASSET.ASSETID = JOINCATEGORYASSET.ASSETID,
> and JOINCATEGORYASSET.CATEGORYID IN (14,15);
>
> ...but the DISTINCT portion of the query would make this a little
> questionable...
>
> Is there a better way of dealing with this type of query?
>
> Thx!
>
> Greg

Firebird has been improved a lot from the version I am currently using
(1.0.3) and may well do this implicitly in 1.5, but I would write the
query above like

SELECT ASSET.ASSETID, ASSET.NAME, ASSET.FOLDERNAME
FROM ASSET
WHERE EXISTS(SELECT *
FROM JOINCATEGORYASSET
WHERE JOINCATEGORYASSET.ASSETID = ASSET.ASSETID
AND JOINCATEGORYASSET.CATEGORYID IN (14,15));

Though this doesn't use any index for the ASSET table (like IN, EXISTS
is calculated on a record by record basis), if you want that to
happen, you either have to use JOIN or write a stored procedure (the
latter may give you the result you want and execute quick).

Also, why do you use DISTINCT within the subselect? I don't think it
causes any slowdown, but it is at best a meaningless addition.

Thinking about it, another option would be

SELECT ASSET.ASSETID, ASSET.NAME, ASSET.FOLDERNAME
FROM ASSET
JOIN JOINCATEGORYASSET J1
ON ASSET.ASSETID = J1.ASSETID
WHERE J1.CATEGORYID IN (14,15)
AND NOT EXISTS(SELECT *
FROM JOINCATEGORYASSET J2
WHERE J2.ASSETID = J1.ASSETID
AND J2.CATEGORYID IN (14,15)
AND J2.<PrimaryKey> < J1.<PrimaryKey>)

This should ascertain that only one instance of JOINCATEGORYASSET is
joined to each record in ASSET and eliminate any potential problems
with duplicate records in ASSET being significant. Also, it could be
quicker if the ASSET table is huge, whereas the JOINCATEGORYASSET
table is small.

HTH,
Set