Subject | Re: [firebird-support] IN, sub-select and select distinct |
---|---|
Author | Jerome Bouvattier |
Post date | 2004-03-12T14:28:01Z |
Arno,
;-)
Table "presence" contains 1.130.000 records
Table "sys_idlists" contains only the 9 required records
Queries recordcounts
with duplicates : 20856
without duplicates : 6939
DDLs :
CREATE TABLE PRESENCE (
ID INTEGER NOT NULL,
CLASSID D_PRES_CLASSID NOT NULL,
OUTLETID INTEGER NOT NULL,
PRODUCTID INTEGER,
SALEPRICE NUMERIC(9,2),
SALEPRICEINCVAT D_BOOLEAN DEFAULT 'Y' NOT NULL,
FACING SMALLINT,
VALUEDATE DATE NOT NULL,
STATUS D_PRES_STATUS NOT NULL,
RECORDLASTUPDATE TIMESTAMP NOT NULL,
RECORDLASTUPDATEBY INTEGER NOT NULL
);
Separate indexes are set on ID, OUTLETID, PRODUCTID, VALUEDATE
CREATE TABLE SYS_IDLISTSs (
GROUPID INTEGER NOT NULL,
ID INTEGER NOT NULL
);
You know the "hardcoded IN" and "JOIN" versions, here is the sub-select one
:
SELECT PRES.PRODUCTID
FROM PRESENCE PRES
WHERE
PRES.VALUEDATE >= '2003-09-01'
and PRES.VALUEDATE < '2003-11-01'
and PRES.OUTLETID IN (select id from sys_idlists where groupid = 1)
and PRES.STATUS IN (select id from sys_idlists where groupid = 2)
As you will see in the below figures, I also tested an indexed version of
"sys_idlists". It speeds up subselects but slowers JOINs.
Summary of perf :
"Hardcoded IN" queries
naked : 541ms |
distinct : 451ms |
group by : 581ms |
"JOIN"
naked : 922ms | Idx/1s 232ms (#1)
distinct : 711ms | idx/1s 242ms (#2)
group by : 711ms |
"Subselect"
naked : 16s | idx/8s 312ms (#3)
distinct : 16s | idx/8s (#4)
group by : 18s |
"Subselect" plans
PLAN (SYS_IDLISTS NATURAL) PLAN (SYS_IDLISTS NATURAL) PLAN (PRES INDEX
(PRESENCE_IDX1))
PLAN (SYS_IDLISTS NATURAL) PLAN (SYS_IDLISTS NATURAL) PLAN SORT ((PRES INDEX
(PRESENCE_IDX1)))
PLAN (SYS_IDLISTS NATURAL) PLAN (SYS_IDLISTS NATURAL) PLAN (PRES ORDER
RELATION_296)
Plans with indexed sys_idlists
(#1)
PLAN JOIN (L1 INDEX (SYS_IDLISTS_IDX_IDX1),L2
INDEX (SYS_IDLISTS_IDX_IDX1),PRES INDEX (RELATION_262))
(#2)
PLAN SORT (JOIN (L1 INDEX (SYS_IDLISTS_IDX_IDX1),L2
INDEX (SYS_IDLISTS_IDX_IDX1),PRES INDEX (RELATION_262)))
(#3)
PLAN (SYS_IDLISTS_IDX INDEX (SYS_IDLISTS_IDX_IDX2,SYS_IDLISTS_IDX_IDX1))
PLAN (SYS_IDLISTS_IDX INDEX (SYS_IDLISTS_IDX_IDX2,SYS_IDLISTS_IDX_IDX1))
PLAN (PRES INDEX (PRESENCE_IDX1))
(#4)
PLAN (SYS_IDLISTS_IDX INDEX (SYS_IDLISTS_IDX_IDX2,SYS_IDLISTS_IDX_IDX1))
PLAN (SYS_IDLISTS_IDX INDEX (SYS_IDLISTS_IDX_IDX2,SYS_IDLISTS_IDX_IDX1))
PLAN SORT ((PRES INDEX (PRESENCE_IDX1)))
451ms vs 8s).
Thanks you very much for help. I hope I was clear enough.
Regards.
--
Jerome
> > > > select pres.productidi
> > > > from sys_idlists L1
> > > > inner join presence pres on L1.id = pres.outletid
> > > > inner join sys_idlists L2 on pres.status = L2.id
> > > > where
> > > > L1.groupid = 1 and L2.groupid = 2
> > > > and pres.valuedate >= '2003-09-01'
> > > > and pres.valuedate < '2003-11-01'
> > > >
> > > > --> 481ms
> > > > --> 590ms with distinct
> > > > --> 590ms with group by
> > > >
> > > > Again, sys_idlists contains only the 9 required records.
> > >
> > > Also show us the PLANs.
> >
> > With fetchall, results become :
> >
> > "Explicit IN"
> >
> > naked : 541ms
> > distinct : 451ms
> > group by : 581ms
> >
> > Plans are :
> > PLAN (PRES INDEX (PRESENCE_IDX1,RELATION_262,RELATION_262,
> > RELATION_262,RELATION_262,RELATION_262,RELATION_262))
>
> > PLAN SORT ((PRES INDEX (PRESENCE_IDX1,RELATION_262,RELATION_262,
> > RELATION_262,RELATION_262,RELATION_262,RELATION_262)))
>
> In fact it is wierd that the "distinct" is faster as the "naked" one, but
> guess that's due the cache or the returning records are a huge differentmy
> between "naked" and "distinct" (then would sending/receiving records over
> the wire be expensive here).
>
> > PLAN (PRES ORDER RELATION_296)
>
> This explains why a GROUP BY is somewhat slower.
>
> > "JOIN" versions
> >
> > naked : 922ms
> > distinct : 711ms
> > group by : 711ms
> >
> > Plans are :
> > PLAN MERGE (SORT (L2 NATURAL),SORT (JOIN (L1 NATURAL,
> > PRES INDEX (RELATION_262))))
>
> > PLAN SORT (MERGE (SORT (L2 NATURAL),SORT (JOIN
> > (L1 NATURAL,PRES INDEX (RELATION_262)))))
>
> Same "strange" situation as without JOIN.
>
> > PLAN SORT (MERGE (SORT (L2 NATURAL),SORT (JOIN
> > (L1 NATURAL,PRES INDEX (RELATION_262)))))
>
> Now you've the same PLAN as the "distinct" and thus the speed is the same.
>
> > > > I can't believe there is no way to achieve the "explicit IN"
> performance
> > > > with a JOIN, a sub-select or anything else.
> > >
> > > A JOIN or sub-select needs always disk reads and there's no way to
> compare
> > > that with hard given constants.
> >
> > I understand but the sys_idlists table really only contains the ids for
> > IN predicates. So, I thought sub-selects would perform just as well asAh ! Good news. Hopefully I'll come up with the well-crafted statement then.
> hard
> > constants.
>
> In fact it should come very close.
;-)
> > ..PRES.OUTLETID IN (112191, 112193, 112194, 112200, 112201, 112205)details
> > ..pres.outletid in (select id from sys_idlists where groupid = 1)
> >
> > But results are worse than with JOINs. I'm above the 16s in each case.
>
> 16 seconds difference? What are both times?
> How many records does sys_idlists have? Could you provide some more
> about the tables in question. I'm interested in why the difference is soi
> huge, want i don't understand at the moment. Okay, some difference is what
> expected.Here are some more details..
Table "presence" contains 1.130.000 records
Table "sys_idlists" contains only the 9 required records
Queries recordcounts
with duplicates : 20856
without duplicates : 6939
DDLs :
CREATE TABLE PRESENCE (
ID INTEGER NOT NULL,
CLASSID D_PRES_CLASSID NOT NULL,
OUTLETID INTEGER NOT NULL,
PRODUCTID INTEGER,
SALEPRICE NUMERIC(9,2),
SALEPRICEINCVAT D_BOOLEAN DEFAULT 'Y' NOT NULL,
FACING SMALLINT,
VALUEDATE DATE NOT NULL,
STATUS D_PRES_STATUS NOT NULL,
RECORDLASTUPDATE TIMESTAMP NOT NULL,
RECORDLASTUPDATEBY INTEGER NOT NULL
);
Separate indexes are set on ID, OUTLETID, PRODUCTID, VALUEDATE
CREATE TABLE SYS_IDLISTSs (
GROUPID INTEGER NOT NULL,
ID INTEGER NOT NULL
);
You know the "hardcoded IN" and "JOIN" versions, here is the sub-select one
:
SELECT PRES.PRODUCTID
FROM PRESENCE PRES
WHERE
PRES.VALUEDATE >= '2003-09-01'
and PRES.VALUEDATE < '2003-11-01'
and PRES.OUTLETID IN (select id from sys_idlists where groupid = 1)
and PRES.STATUS IN (select id from sys_idlists where groupid = 2)
As you will see in the below figures, I also tested an indexed version of
"sys_idlists". It speeds up subselects but slowers JOINs.
Summary of perf :
"Hardcoded IN" queries
naked : 541ms |
distinct : 451ms |
group by : 581ms |
"JOIN"
naked : 922ms | Idx/1s 232ms (#1)
distinct : 711ms | idx/1s 242ms (#2)
group by : 711ms |
"Subselect"
naked : 16s | idx/8s 312ms (#3)
distinct : 16s | idx/8s (#4)
group by : 18s |
"Subselect" plans
PLAN (SYS_IDLISTS NATURAL) PLAN (SYS_IDLISTS NATURAL) PLAN (PRES INDEX
(PRESENCE_IDX1))
PLAN (SYS_IDLISTS NATURAL) PLAN (SYS_IDLISTS NATURAL) PLAN SORT ((PRES INDEX
(PRESENCE_IDX1)))
PLAN (SYS_IDLISTS NATURAL) PLAN (SYS_IDLISTS NATURAL) PLAN (PRES ORDER
RELATION_296)
Plans with indexed sys_idlists
(#1)
PLAN JOIN (L1 INDEX (SYS_IDLISTS_IDX_IDX1),L2
INDEX (SYS_IDLISTS_IDX_IDX1),PRES INDEX (RELATION_262))
(#2)
PLAN SORT (JOIN (L1 INDEX (SYS_IDLISTS_IDX_IDX1),L2
INDEX (SYS_IDLISTS_IDX_IDX1),PRES INDEX (RELATION_262)))
(#3)
PLAN (SYS_IDLISTS_IDX INDEX (SYS_IDLISTS_IDX_IDX2,SYS_IDLISTS_IDX_IDX1))
PLAN (SYS_IDLISTS_IDX INDEX (SYS_IDLISTS_IDX_IDX2,SYS_IDLISTS_IDX_IDX1))
PLAN (PRES INDEX (PRESENCE_IDX1))
(#4)
PLAN (SYS_IDLISTS_IDX INDEX (SYS_IDLISTS_IDX_IDX2,SYS_IDLISTS_IDX_IDX1))
PLAN (SYS_IDLISTS_IDX INDEX (SYS_IDLISTS_IDX_IDX2,SYS_IDLISTS_IDX_IDX1))
PLAN SORT ((PRES INDEX (PRESENCE_IDX1)))
> > Actually, my ultimate goal is to have parameterized queries. I couldbuild
> > the SQL dynamically, but I also need to apply additional WHEREconstraints
> > on those queries sometimes. Currently I'm using selectable SPs buildingthe
> the
> > IN predicates from string params and calling EXECUTE STATEMENT.
> > Works great, but not very clean...
> > That's why I thought to that sys_idlists table filled (and emptied in
> > same tx) with the IDs required for my IN predicates.Well, it "works". It's just that's slower right now. (451ms vs 711ms or
>
> Okay, should work fine i think.
451ms vs 8s).
> If you've a index on that table it's alsoOk, what would be that for ? GC ?
> interesting to clean up the entries by running "SELECT Count(*) FROM
> sys_idlists" in a new transaction after you finished the running
> transaction.
Thanks you very much for help. I hope I was clear enough.
Regards.
--
Jerome