Subject | Slowdown with date comparison in query |
---|---|
Author | Rajesh Punjabi |
Post date | 2004-12-23T15:44:43Z |
Hi Everyone,
I am using FB 1.5 on RH Linux 8 with 2.4.18 kernel. My hardware is a AMD
Athlon 1800+ XP box with 1.5GB RAM and a 36 GB SCSI hdd.
My db size is about 1.2 GB split into seven files currently. Page size
is 8 KB.
The following query runs fine. It returns in about 5 secs. However the
moment I add a date field comparison it goes for a toss. I have
especially found that using containing instead of like significantly
improves the performance of this particular query. All fields in this
query have been indexed and I keep dropping and recreating the indexes
every week at the beginning of the week for safety sake.
Here is the query that runs fine :
select P.PRFCD, P.NM, (select S.SKILL_TYPE from SKILLS S where S.SKLCD =
P.PRFSKLCD) DESIG, P.SEX , P.CITY ,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'Off Tel') OFFTEL,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'Res Tel') TELRES,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'Mobile') CELLNO, P.PRFQUAL ,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'E-mail') PRFMAIL1,
(select first 1 skip 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD
= P.PRFCD and T.CONTACT_TYPE containing 'E-mail') PRFMAIL2,
(select first 1 skip 2 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD
= P.PRFCD and T.CONTACT_TYPE containing 'E-mail') PRFMAIL3, P.PRF_DT,
(select count(B.INTID) from BDFLOW B where B.FPRFCD = P.PRFCD) INT_CNT,
(select count(t.contactid) from prof_contact t where t.prfcd = p.prfcd
and t.CONTACT_TYPE containing 'E-mail') eid_cnter_all,
(select count(t.contactid) from prof_contact t where t.prfcd = p.prfcd
and t.CONTACT_TYPE containing 'E-mail' and (t.status = 'L' or t.status =
'B')) eid_cnter_bad, P.PRFSKLCD
from PROFESSIONAL P
Where
exists (select B.INTID from BDFLOW B where B.fprfcd = p.prfcd and
B.OFR_CD = 9878 and B.RESP_CD containing '0029' and B.DISCUSS containing
'Proposal')
group by P.PRFCD, P.NM, P.SEX , P.CITY , P.PRFQUAL , P.PRF_DT, P.PRFSKLCD
order by p.nm, p.prfcd
Explanation : There are three tables - PROFESSIONAL P, BDFLOW B,
PROF_CONTACT T
PROFESSIONAL is a master with entries of individuals
PROF_CONTACT contains the contact details like E-mail, Res Tel, Off Tel
etc of each professional
BDFLOW contains workflow entries for each professional OFR_CD an
integer, RESP_CD is a string (Varchar 32) and DISCUSS is basically remarks
SKILLS is a lookup table for cataloging purposes. (Removing or including
it has no affect whatsoever)
Any PROFESSIONAL (PK - PRFCD)can have multiple entries in BDFLOW (PK -
INTID, FK - FPRFCD)
Any PROFESSIONAL (PK - PRFCD)can have multiple entries in PROF_CONTACT
(PK - CONTACTID, FK - PRFCD)
Now this query runs in less than 10 secs for all practical purposes.
When only one field in BDFLOW is added which is SPOKEN_DATE -
(Timestamp) the whole query performance goes for a toss. The problematic
query looks like this :
select P.PRFCD, P.NM, (select S.SKILL_TYPE from SKILLS S where S.SKLCD =
P.PRFSKLCD) DESIG, P.SEX , P.CITY ,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'Off Tel') OFFTEL,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'Res Tel') TELRES,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'Mobile') CELLNO, P.PRFQUAL ,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'E-mail') PRFMAIL1,
(select first 1 skip 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD
= P.PRFCD and T.CONTACT_TYPE containing 'E-mail') PRFMAIL2,
(select first 1 skip 2 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD
= P.PRFCD and T.CONTACT_TYPE containing 'E-mail') PRFMAIL3, P.PRF_DT,
(select count(B.INTID) from BDFLOW B where B.FPRFCD = P.PRFCD) INT_CNT,
(select count(t.contactid) from prof_contact t where t.prfcd = p.prfcd
and t.CONTACT_TYPE containing 'E-mail') eid_cnter_all,
(select count(t.contactid) from prof_contact t where t.prfcd = p.prfcd
and t.CONTACT_TYPE containing 'E-mail' and (t.status = 'L' or t.status =
'B')) eid_cnter_bad, P.PRFSKLCD
from PROFESSIONAL P
Where
exists (select B.INTID from BDFLOW B where B.fprfcd = p.prfcd and
B.OFR_CD = 9878 and B.RESP_CD containing '0029' and B.DISCUSS containing
'Proposal' _and B.SPOKEN_DATE > '12/01/2004 00:00:00'_)
group by P.PRFCD, P.NM, P.SEX , P.CITY , P.PRFQUAL , P.PRF_DT, P.PRFSKLCD
order by p.nm, p.prfcd
Now the query takes about 20 secs to return. There are only a few
hundred entries after 12/01/2004 (December 1, 2004). If the query is
changed to compare dates before December 1, 2004 (B.SPOKEN_DATE <
'12/01/2004 00:00:00') the performance degrade is worse. (About 480 secs
and more). In both cases the total results thrown are the same Approx 34
records. (i.e. 34 people have this kind of workflow entries before
december 1, 2004 and nearly the same number have them after that date)
There exists some overlap also amongst these people.
The database counts are somewhat like this :
PROFESSIONAL ~ 75,000 records total
PROF_CONTACT ~ 225,000 records total
BDFLOW ~ 285,000 records before December 1, 2004 & ~ 15,000 records
after December 1, 2004
It appears prima facie that FB goes into a tailspin with a large volume
to compute especially with date comparison. The same does not hold true
for string and integer values. Even if I bring the exists clause down to :
select B.INTID from BDFLOW B where B.fprfcd = p.prfcd and B.SPOKEN_DATE
am running the query from isql prompt on the server directly to test and
reporting these timings so network problems etc. are anyways ruled out.
Incidentally I have a date (PRF_DT) in the professional table which if
brought into the comparison does not cause the query to degrade. only
date entries in the workflow detail table (BDFLOW) cause this problem.
I have sufficient free RAM on the server and the CPU goes consistently
upto 98% utilization when any of the above queries are being executed.
The output of show database; is as follows :
SQL> show database;
Database: /home/kms/hr.fdb
Owner: SYSDBA
File 1: "/home/kms/hr2.fdb", length 0, start 25001
File 2: "/home/kms/hr3.fdb", length 0, start 50001
File 3: "/home/kms/hr4.fdb", length 0, start 75001
File 4: "/home/kms/hr5.fdb", length 0, start 100001
File 5: "/home/kms/hr6.fdb", length 0, start 125001
File 6: "/home/kms/hr7.fdb", length 0, start 150001
PAGE_SIZE 8192
Number of DB pages allocated = 151861
Sweep interval = 20000
Forced Writes are ON
Transaction - oldest = 879447
Transaction - oldest active = 879448
Transaction - oldest snapshot = 879448
Transaction - Next = 879451
Default Character set: NONE
The relevant output of ./gstat is as follows :
PROFESSIONAL (134)
Primary pointer page: 156, Index root page: 157
Data pages: 33602, data page slots: 33649, average fill: 76%
Fill distribution:
0 - 19% = 1174
20 - 39% = 811
40 - 59% = 4515
60 - 79% = 9700
80 - 99% = 17402
BDFLOW (136)
Primary pointer page: 160, Index root page: 161
Data pages: 5525, data page slots: 5525, average fill: 84%
Fill distribution:
0 - 19% = 0
20 - 39% = 0
40 - 59% = 0
60 - 79% = 1
80 - 99% = 5524
_(This is the spoken date index stats )_
Index I_SPOKENDATE_IDX (6)
Depth: 2, leaf buckets: 259, nodes: 298497
Average data length: 1.00, total dup: 192167, max dup: 20765
Fill distribution:
0 - 19% = 0
20 - 39% = 0
40 - 59% = 1
60 - 79% = 0
80 - 99% = 258
PROF_CONTACT (147)
Primary pointer page: 182, Index root page: 183
Data pages: 3487, data page slots: 3487, average fill: 82%
Fill distribution:
0 - 19% = 1
20 - 39% = 9
40 - 59% = 13
60 - 79% = 51
80 - 99% = 3413
I have tried casting the date and using containing instead of comparison
operator for the spoke_date. Both do not logically serve the purpose of
the query but have no effect on the query performance and it remains
poor. Let me know if there is a problem with my query and I need to
change something or I am missing something somewhere or there is some
other problem here. Can anyone help ?
Warm regards,
RP
+91-9820137720
www.initiaworld.com
I am using FB 1.5 on RH Linux 8 with 2.4.18 kernel. My hardware is a AMD
Athlon 1800+ XP box with 1.5GB RAM and a 36 GB SCSI hdd.
My db size is about 1.2 GB split into seven files currently. Page size
is 8 KB.
The following query runs fine. It returns in about 5 secs. However the
moment I add a date field comparison it goes for a toss. I have
especially found that using containing instead of like significantly
improves the performance of this particular query. All fields in this
query have been indexed and I keep dropping and recreating the indexes
every week at the beginning of the week for safety sake.
Here is the query that runs fine :
select P.PRFCD, P.NM, (select S.SKILL_TYPE from SKILLS S where S.SKLCD =
P.PRFSKLCD) DESIG, P.SEX , P.CITY ,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'Off Tel') OFFTEL,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'Res Tel') TELRES,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'Mobile') CELLNO, P.PRFQUAL ,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'E-mail') PRFMAIL1,
(select first 1 skip 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD
= P.PRFCD and T.CONTACT_TYPE containing 'E-mail') PRFMAIL2,
(select first 1 skip 2 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD
= P.PRFCD and T.CONTACT_TYPE containing 'E-mail') PRFMAIL3, P.PRF_DT,
(select count(B.INTID) from BDFLOW B where B.FPRFCD = P.PRFCD) INT_CNT,
(select count(t.contactid) from prof_contact t where t.prfcd = p.prfcd
and t.CONTACT_TYPE containing 'E-mail') eid_cnter_all,
(select count(t.contactid) from prof_contact t where t.prfcd = p.prfcd
and t.CONTACT_TYPE containing 'E-mail' and (t.status = 'L' or t.status =
'B')) eid_cnter_bad, P.PRFSKLCD
from PROFESSIONAL P
Where
exists (select B.INTID from BDFLOW B where B.fprfcd = p.prfcd and
B.OFR_CD = 9878 and B.RESP_CD containing '0029' and B.DISCUSS containing
'Proposal')
group by P.PRFCD, P.NM, P.SEX , P.CITY , P.PRFQUAL , P.PRF_DT, P.PRFSKLCD
order by p.nm, p.prfcd
Explanation : There are three tables - PROFESSIONAL P, BDFLOW B,
PROF_CONTACT T
PROFESSIONAL is a master with entries of individuals
PROF_CONTACT contains the contact details like E-mail, Res Tel, Off Tel
etc of each professional
BDFLOW contains workflow entries for each professional OFR_CD an
integer, RESP_CD is a string (Varchar 32) and DISCUSS is basically remarks
SKILLS is a lookup table for cataloging purposes. (Removing or including
it has no affect whatsoever)
Any PROFESSIONAL (PK - PRFCD)can have multiple entries in BDFLOW (PK -
INTID, FK - FPRFCD)
Any PROFESSIONAL (PK - PRFCD)can have multiple entries in PROF_CONTACT
(PK - CONTACTID, FK - PRFCD)
Now this query runs in less than 10 secs for all practical purposes.
When only one field in BDFLOW is added which is SPOKEN_DATE -
(Timestamp) the whole query performance goes for a toss. The problematic
query looks like this :
select P.PRFCD, P.NM, (select S.SKILL_TYPE from SKILLS S where S.SKLCD =
P.PRFSKLCD) DESIG, P.SEX , P.CITY ,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'Off Tel') OFFTEL,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'Res Tel') TELRES,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'Mobile') CELLNO, P.PRFQUAL ,
(select first 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD =
P.PRFCD and T.CONTACT_TYPE containing 'E-mail') PRFMAIL1,
(select first 1 skip 1 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD
= P.PRFCD and T.CONTACT_TYPE containing 'E-mail') PRFMAIL2,
(select first 1 skip 2 T.CONTACT_VALUE from PROF_CONTACT T where T.PRFCD
= P.PRFCD and T.CONTACT_TYPE containing 'E-mail') PRFMAIL3, P.PRF_DT,
(select count(B.INTID) from BDFLOW B where B.FPRFCD = P.PRFCD) INT_CNT,
(select count(t.contactid) from prof_contact t where t.prfcd = p.prfcd
and t.CONTACT_TYPE containing 'E-mail') eid_cnter_all,
(select count(t.contactid) from prof_contact t where t.prfcd = p.prfcd
and t.CONTACT_TYPE containing 'E-mail' and (t.status = 'L' or t.status =
'B')) eid_cnter_bad, P.PRFSKLCD
from PROFESSIONAL P
Where
exists (select B.INTID from BDFLOW B where B.fprfcd = p.prfcd and
B.OFR_CD = 9878 and B.RESP_CD containing '0029' and B.DISCUSS containing
'Proposal' _and B.SPOKEN_DATE > '12/01/2004 00:00:00'_)
group by P.PRFCD, P.NM, P.SEX , P.CITY , P.PRFQUAL , P.PRF_DT, P.PRFSKLCD
order by p.nm, p.prfcd
Now the query takes about 20 secs to return. There are only a few
hundred entries after 12/01/2004 (December 1, 2004). If the query is
changed to compare dates before December 1, 2004 (B.SPOKEN_DATE <
'12/01/2004 00:00:00') the performance degrade is worse. (About 480 secs
and more). In both cases the total results thrown are the same Approx 34
records. (i.e. 34 people have this kind of workflow entries before
december 1, 2004 and nearly the same number have them after that date)
There exists some overlap also amongst these people.
The database counts are somewhat like this :
PROFESSIONAL ~ 75,000 records total
PROF_CONTACT ~ 225,000 records total
BDFLOW ~ 285,000 records before December 1, 2004 & ~ 15,000 records
after December 1, 2004
It appears prima facie that FB goes into a tailspin with a large volume
to compute especially with date comparison. The same does not hold true
for string and integer values. Even if I bring the exists clause down to :
select B.INTID from BDFLOW B where B.fprfcd = p.prfcd and B.SPOKEN_DATE
> '12/01/2004 00:00:00'the problem is still not solved and query times are still pathetic. I
am running the query from isql prompt on the server directly to test and
reporting these timings so network problems etc. are anyways ruled out.
Incidentally I have a date (PRF_DT) in the professional table which if
brought into the comparison does not cause the query to degrade. only
date entries in the workflow detail table (BDFLOW) cause this problem.
I have sufficient free RAM on the server and the CPU goes consistently
upto 98% utilization when any of the above queries are being executed.
The output of show database; is as follows :
SQL> show database;
Database: /home/kms/hr.fdb
Owner: SYSDBA
File 1: "/home/kms/hr2.fdb", length 0, start 25001
File 2: "/home/kms/hr3.fdb", length 0, start 50001
File 3: "/home/kms/hr4.fdb", length 0, start 75001
File 4: "/home/kms/hr5.fdb", length 0, start 100001
File 5: "/home/kms/hr6.fdb", length 0, start 125001
File 6: "/home/kms/hr7.fdb", length 0, start 150001
PAGE_SIZE 8192
Number of DB pages allocated = 151861
Sweep interval = 20000
Forced Writes are ON
Transaction - oldest = 879447
Transaction - oldest active = 879448
Transaction - oldest snapshot = 879448
Transaction - Next = 879451
Default Character set: NONE
The relevant output of ./gstat is as follows :
PROFESSIONAL (134)
Primary pointer page: 156, Index root page: 157
Data pages: 33602, data page slots: 33649, average fill: 76%
Fill distribution:
0 - 19% = 1174
20 - 39% = 811
40 - 59% = 4515
60 - 79% = 9700
80 - 99% = 17402
BDFLOW (136)
Primary pointer page: 160, Index root page: 161
Data pages: 5525, data page slots: 5525, average fill: 84%
Fill distribution:
0 - 19% = 0
20 - 39% = 0
40 - 59% = 0
60 - 79% = 1
80 - 99% = 5524
_(This is the spoken date index stats )_
Index I_SPOKENDATE_IDX (6)
Depth: 2, leaf buckets: 259, nodes: 298497
Average data length: 1.00, total dup: 192167, max dup: 20765
Fill distribution:
0 - 19% = 0
20 - 39% = 0
40 - 59% = 1
60 - 79% = 0
80 - 99% = 258
PROF_CONTACT (147)
Primary pointer page: 182, Index root page: 183
Data pages: 3487, data page slots: 3487, average fill: 82%
Fill distribution:
0 - 19% = 1
20 - 39% = 9
40 - 59% = 13
60 - 79% = 51
80 - 99% = 3413
I have tried casting the date and using containing instead of comparison
operator for the spoke_date. Both do not logically serve the purpose of
the query but have no effect on the query performance and it remains
poor. Let me know if there is a problem with my query and I need to
change something or I am missing something somewhere or there is some
other problem here. Can anyone help ?
Warm regards,
RP
+91-9820137720
www.initiaworld.com