firebird-support - Re: [firebird-support] IN, sub-select and select distinct

Subject	Re: [firebird-support] IN, sub-select and select distinct
Author	Jerome Bouvattier
Post date	2004-03-12T14:28:01Z

Arno,

> > > > select pres.productid
> > > > from sys_idlists L1
> > > > inner join presence pres on L1.id = pres.outletid
> > > > inner join sys_idlists L2 on pres.status = L2.id
> > > > where
> > > > L1.groupid = 1 and L2.groupid = 2
> > > > and pres.valuedate >= '2003-09-01'
> > > > and pres.valuedate < '2003-11-01'
> > > >
> > > > --> 481ms
> > > > --> 590ms with distinct
> > > > --> 590ms with group by
> > > >
> > > > Again, sys_idlists contains only the 9 required records.
> > >
> > > Also show us the PLANs.
> >
> > With fetchall, results become :
> >
> > "Explicit IN"
> >
> > naked : 541ms
> > distinct : 451ms
> > group by : 581ms
> >
> > Plans are :
> > PLAN (PRES INDEX (PRESENCE_IDX1,RELATION_262,RELATION_262,
> > RELATION_262,RELATION_262,RELATION_262,RELATION_262))
>
> > PLAN SORT ((PRES INDEX (PRESENCE_IDX1,RELATION_262,RELATION_262,
> > RELATION_262,RELATION_262,RELATION_262,RELATION_262)))
>
> In fact it is wierd that the "distinct" is faster as the "naked" one, but

> guess that's due the cache or the returning records are a huge different
> between "naked" and "distinct" (then would sending/receiving records over
> the wire be expensive here).
>
> > PLAN (PRES ORDER RELATION_296)
>
> This explains why a GROUP BY is somewhat slower.
>
> > "JOIN" versions
> >
> > naked : 922ms
> > distinct : 711ms
> > group by : 711ms
> >
> > Plans are :
> > PLAN MERGE (SORT (L2 NATURAL),SORT (JOIN (L1 NATURAL,
> > PRES INDEX (RELATION_262))))
>
> > PLAN SORT (MERGE (SORT (L2 NATURAL),SORT (JOIN
> > (L1 NATURAL,PRES INDEX (RELATION_262)))))
>
> Same "strange" situation as without JOIN.
>
> > PLAN SORT (MERGE (SORT (L2 NATURAL),SORT (JOIN
> > (L1 NATURAL,PRES INDEX (RELATION_262)))))
>
> Now you've the same PLAN as the "distinct" and thus the speed is the same.
>
> > > > I can't believe there is no way to achieve the "explicit IN"
> performance
> > > > with a JOIN, a sub-select or anything else.
> > >
> > > A JOIN or sub-select needs always disk reads and there's no way to
> compare
> > > that with hard given constants.
> >
> > I understand but the sys_idlists table really only contains the ids for

> > IN predicates. So, I thought sub-selects would perform just as well as
> hard
> > constants.
>
> In fact it should come very close.

Ah ! Good news. Hopefully I'll come up with the well-crafted statement then.
;-)

> > ..PRES.OUTLETID IN (112191, 112193, 112194, 112200, 112201, 112205)
> > ..pres.outletid in (select id from sys_idlists where groupid = 1)
> >
> > But results are worse than with JOINs. I'm above the 16s in each case.
>
> 16 seconds difference? What are both times?
> How many records does sys_idlists have? Could you provide some more

details

> about the tables in question. I'm interested in why the difference is so
> huge, want i don't understand at the moment. Okay, some difference is what

> expected.

Here are some more details..

Table "presence" contains 1.130.000 records
Table "sys_idlists" contains only the 9 required records

Queries recordcounts
with duplicates : 20856
without duplicates : 6939

DDLs :

CREATE TABLE PRESENCE (
ID INTEGER NOT NULL,
CLASSID D_PRES_CLASSID NOT NULL,
OUTLETID INTEGER NOT NULL,
PRODUCTID INTEGER,
SALEPRICE NUMERIC(9,2),
SALEPRICEINCVAT D_BOOLEAN DEFAULT 'Y' NOT NULL,
FACING SMALLINT,
VALUEDATE DATE NOT NULL,
STATUS D_PRES_STATUS NOT NULL,
RECORDLASTUPDATE TIMESTAMP NOT NULL,
RECORDLASTUPDATEBY INTEGER NOT NULL
);

Separate indexes are set on ID, OUTLETID, PRODUCTID, VALUEDATE

CREATE TABLE SYS_IDLISTSs (
GROUPID INTEGER NOT NULL,
ID INTEGER NOT NULL
);

You know the "hardcoded IN" and "JOIN" versions, here is the sub-select one
:

SELECT PRES.PRODUCTID
FROM PRESENCE PRES
WHERE
PRES.VALUEDATE >= '2003-09-01'
and PRES.VALUEDATE < '2003-11-01'
and PRES.OUTLETID IN (select id from sys_idlists where groupid = 1)
and PRES.STATUS IN (select id from sys_idlists where groupid = 2)

As you will see in the below figures, I also tested an indexed version of
"sys_idlists". It speeds up subselects but slowers JOINs.

Summary of perf :

"Hardcoded IN" queries

naked : 541ms |
distinct : 451ms |
group by : 581ms |

"JOIN"

naked : 922ms | Idx/1s 232ms (#1)
distinct : 711ms | idx/1s 242ms (#2)
group by : 711ms |

"Subselect"

naked : 16s | idx/8s 312ms (#3)
distinct : 16s | idx/8s (#4)
group by : 18s |

"Subselect" plans

PLAN (SYS_IDLISTS NATURAL) PLAN (SYS_IDLISTS NATURAL) PLAN (PRES INDEX
(PRESENCE_IDX1))
PLAN (SYS_IDLISTS NATURAL) PLAN (SYS_IDLISTS NATURAL) PLAN SORT ((PRES INDEX
(PRESENCE_IDX1)))
PLAN (SYS_IDLISTS NATURAL) PLAN (SYS_IDLISTS NATURAL) PLAN (PRES ORDER
RELATION_296)

Plans with indexed sys_idlists

(#1)
PLAN JOIN (L1 INDEX (SYS_IDLISTS_IDX_IDX1),L2
INDEX (SYS_IDLISTS_IDX_IDX1),PRES INDEX (RELATION_262))
(#2)
PLAN SORT (JOIN (L1 INDEX (SYS_IDLISTS_IDX_IDX1),L2
INDEX (SYS_IDLISTS_IDX_IDX1),PRES INDEX (RELATION_262)))
(#3)
PLAN (SYS_IDLISTS_IDX INDEX (SYS_IDLISTS_IDX_IDX2,SYS_IDLISTS_IDX_IDX1))
PLAN (SYS_IDLISTS_IDX INDEX (SYS_IDLISTS_IDX_IDX2,SYS_IDLISTS_IDX_IDX1))
PLAN (PRES INDEX (PRESENCE_IDX1))
(#4)
PLAN (SYS_IDLISTS_IDX INDEX (SYS_IDLISTS_IDX_IDX2,SYS_IDLISTS_IDX_IDX1))
PLAN (SYS_IDLISTS_IDX INDEX (SYS_IDLISTS_IDX_IDX2,SYS_IDLISTS_IDX_IDX1))
PLAN SORT ((PRES INDEX (PRESENCE_IDX1)))

> > Actually, my ultimate goal is to have parameterized queries. I could

build

> > the SQL dynamically, but I also need to apply additional WHERE

constraints

> > on those queries sometimes. Currently I'm using selectable SPs building
> the
> > IN predicates from string params and calling EXECUTE STATEMENT.
> > Works great, but not very clean...
> > That's why I thought to that sys_idlists table filled (and emptied in

the

> > same tx) with the IDs required for my IN predicates.
>
> Okay, should work fine i think.

Well, it "works". It's just that's slower right now. (451ms vs 711ms or
451ms vs 8s).

> If you've a index on that table it's also
> interesting to clean up the entries by running "SELECT Count(*) FROM
> sys_idlists" in a new transaction after you finished the running
> transaction.

Ok, what would be that for ? GC ?

Thanks you very much for help. I hope I was clear enough.

Regards.

--
Jerome