Subject | RE: [firebird-support] Terrible left join performance |
---|---|
Author | Svein Erling Tysvær |
Post date | 2007-10-22T11:36:07Z |
Hi Richard!
I don't think your query is sensible - that is, unless "zip_codes"."Zip" isn't unique and you want "vw_ExportForTableau"."CapRate" to be given extra weight in case of multiple "zip_codes"."Zip" (i.e. if you have one row with "CapRate" 4 and one with "CapRate" 1 and the latter linking to two "Zip"s then you would want the average to be 2 ((4+1+1)/3), not 2.5). Though I find this scenario so strange that I simply expect the reference to "zip_codes" to be irrelevant in this case, but possibly more relevant in a similar situation that you just tried to simplify.
Let me rewrite your statement to be more easily readable (at least for me):
SELECT vw."Main Type" AS "none:Main Type:nk",
vw."Offered_dt" AS "none:Offered_dt:qk",
AVG(CAST(vw."CapRate" AS DOUBLE PRECISION)) AS "avg:CapRate:qk"
FROM "vw_ExportForTableau$" vw
LEFT JOIN "zip_codes$" zip ON vw."Zip_cd" = zip."Zip"
WHERE NOT vw."Main Type" IN ('Dev Site', 'Mixed')
AND vw."Offered_dt" BETWEEN '1.1.2006' AND '31.12.2006'
GROUP BY 1, 2
Note that I did change from EXTRACT to BETWEEN. This is because BETWEEN can use an index (with Fb 2 you could alternatively use an expression index).
I also removed lots of parenthesis for readability, and made table aliases case insensitive.
Excepting the possible use of an index for vw."Offered_dt", this should use the same indexes as your original query. What I don't understand is that the original query didn't use an index for zip."Zip". Isn't this field indexed or haven't the index statistics been updated since your last massive insert/delete into/from this table? (in your DDL scripts you haven't mentioned this important index). Unless vw is a huge table (i.e. millions of rows), it isn't all too bad using NATURAL on this table. However, using NATURAL on the zip table as well, destroys the performance if there are more than a handful of records in the zip table.
I don't think the index for vw."Main Type" will be used, but since you are comparing to fixed values rather than a subselect, I don't (yet) agree with Adam and don't think there's any point in changing to a NOT EXISTS ('yet' indicating that I don't use Fb 2.0 myself, and that he might have knowledge that I have missed). The index for "Offered_dt" will only be useful if there are lots of records that aren't from 2006.
At the conference last week, I saw a statement that had been rewritten from
FROM "vw_ExportForTableau$" vw
LEFT JOIN "zip_codes$" zip ON vw."Zip_cd" = zip."Zip"
to
FROM "vw_ExportForTableau$" vw
LEFT JOIN (select * from "zip_codes$" order by "Zip") zip ON vw."Zip_cd" = zip."Zip"
(or similar, it was at least an inner join rather than a left join)
but even though that made it a bit quicker in the particular case where it had been measured, Arno (I asked mr. Optimizer a bit later) said he didn't think this was a general way to speed up queries. Moreover, it makes things considerably harder to read, so I wouldn't recommend to try this unless every millisecond of performance counts.
And my guess is that the main reason for your performance troubles are as simple as a missing index for zip."Zip".
HTH,
Set
-----Original Message-----
From: firebird-support@yahoogroups.com [mailto:firebird-support@yahoogroups.com] On Behalf Of Richard Wesley
Sent: 20. oktober 2007 00:04
To: firebird-support@yahoogroups.com
Subject: [firebird-support] Terrible left join performance
Hi All -
I have just been given a case where Firebird falls off a cliff
compared to Excel (using Jet). The culprit seems to be a left join.
SELECT ("vw_ExportForTableau"."Main Type") AS "none:Main Type:nk",
("vw_ExportForTableau"."Offered_dt") AS "none:Offered_dt:qk",
(AVG(CAST("vw_ExportForTableau"."CapRate" AS DOUBLE PRECISION)))
AS "avg:CapRate:qk"
FROM "vw_ExportForTableau$" "vw_ExportForTableau"
LEFT JOIN "zip_codes$" "zip_codes" ON
("vw_ExportForTableau"."Zip_cd" = "zip_codes"."Zip")
WHERE ((NOT (("vw_ExportForTableau"."Main Type") IN ( 'Dev Site',
'Mixed')) ) AND ((EXTRACT(YEAR FROM
"vw_ExportForTableau"."Offered_dt")) = 2006) )
GROUP BY 1,
2;
<PLAN>
PLAN SORT (JOIN (vw_ExportForTableau NATURAL, zip_codes NATURAL))
</PLAN>
[Time] Running the query took 272.2820 sec.
[Time] Getting the records took 0.0450 sec.
[Count] Query returned 1379 records.
By comparison, Excel executes the same query on the same data in half
a second:
SELECT ([_TableauRelation_].[Main Type]) AS [none:Main Type:nk],
([_TableauRelation_].[Offered_dt]) AS [none:Offered_dt:qk],
(AVG([_TableauRelation_].[CapRate])) AS [avg:CapRate:qk]
FROM (
SELECT [vw_ExportForTableau].[PercentOfAsk] AS [PercentOfAsk],
<long list of fields because of Jet's join rules...>
FROM [vw_ExportForTableau$] [vw_ExportForTableau]
LEFT JOIN [zip_codes$] [zip_codes] ON ([vw_ExportForTableau].
[Zip_cd] = [zip_codes].[Zip])
) [_TableauRelation_]
WHERE ((NOT (([_TableauRelation_].[Main Type]) IN ('Dev Site',
'Mixed')) ) AND ((DATEPART('yyyy',[_TableauRelation_].[Offered_dt]))
= 2006) )
GROUP BY ([_TableauRelation_].[Main Type]),
([_TableauRelation_].[Offered_dt]);
[Time] Running the query took 0.4285 sec.
[Time] Getting the records took 0.0164 sec.
[Count] Query returned 1379 records.
To make it worse, the Firebird tables are fully indexed.
I can't give the full schema as it is customer data, but one table is
a generic zip code table and looks like this:
CREATE TABLE "zip_codes$" (
"City" VARCHAR(26) CHARACTER SET UTF8 COLLATE UNICODE,
"County" VARCHAR(25) CHARACTER SET UTF8 COLLATE UNICODE,
"Latitude" FLOAT(53),
"Longitude" FLOAT(53),
"State" VARCHAR(2) CHARACTER SET UTF8 COLLATE UNICODE,
"Zip Class" VARCHAR(11) CHARACTER SET UTF8 COLLATE UNICODE,
"Zip" FLOAT(53)
);
INSERT INTO "zip_codes$"
("City", "County", "Latitude", "Longitude", "State", "Zip
Class", "Zip")
VALUES(?, ?, ?, ?, ?, ?, ?);
Inserted 42741 rows in 6.426 seconds
The other table is a very wide and cutting the fields down to those
in the query:
CREATE TABLE "vw_ExportForTableau$" (
"CapRate" FLOAT(53),
"Main Type" VARCHAR(10) CHARACTER SET UTF8 COLLATE UNICODE,
"Offered_dt" TIMESTAMP,
"Zip_cd" FLOAT(53)
);
INSERT INTO "vw_ExportForTableau$"
("CapRate", "Main Type", "Offered_dt", "Zip_cd")
VALUES(?, ?, ?, ?)
Inserted 36630 rows in 37.560 seconds
CREATE INDEX "_tidx_128_31a" ON "vw_ExportForTableau$" ("Offered_dt");
CREATE INDEX "_tidx_128_22a" ON "vw_ExportForTableau$" ("Main Type");
Any ideas on why this is so much slower? Could it be the float field
used for the join? No better in 2.1b2.
I don't think your query is sensible - that is, unless "zip_codes"."Zip" isn't unique and you want "vw_ExportForTableau"."CapRate" to be given extra weight in case of multiple "zip_codes"."Zip" (i.e. if you have one row with "CapRate" 4 and one with "CapRate" 1 and the latter linking to two "Zip"s then you would want the average to be 2 ((4+1+1)/3), not 2.5). Though I find this scenario so strange that I simply expect the reference to "zip_codes" to be irrelevant in this case, but possibly more relevant in a similar situation that you just tried to simplify.
Let me rewrite your statement to be more easily readable (at least for me):
SELECT vw."Main Type" AS "none:Main Type:nk",
vw."Offered_dt" AS "none:Offered_dt:qk",
AVG(CAST(vw."CapRate" AS DOUBLE PRECISION)) AS "avg:CapRate:qk"
FROM "vw_ExportForTableau$" vw
LEFT JOIN "zip_codes$" zip ON vw."Zip_cd" = zip."Zip"
WHERE NOT vw."Main Type" IN ('Dev Site', 'Mixed')
AND vw."Offered_dt" BETWEEN '1.1.2006' AND '31.12.2006'
GROUP BY 1, 2
Note that I did change from EXTRACT to BETWEEN. This is because BETWEEN can use an index (with Fb 2 you could alternatively use an expression index).
I also removed lots of parenthesis for readability, and made table aliases case insensitive.
Excepting the possible use of an index for vw."Offered_dt", this should use the same indexes as your original query. What I don't understand is that the original query didn't use an index for zip."Zip". Isn't this field indexed or haven't the index statistics been updated since your last massive insert/delete into/from this table? (in your DDL scripts you haven't mentioned this important index). Unless vw is a huge table (i.e. millions of rows), it isn't all too bad using NATURAL on this table. However, using NATURAL on the zip table as well, destroys the performance if there are more than a handful of records in the zip table.
I don't think the index for vw."Main Type" will be used, but since you are comparing to fixed values rather than a subselect, I don't (yet) agree with Adam and don't think there's any point in changing to a NOT EXISTS ('yet' indicating that I don't use Fb 2.0 myself, and that he might have knowledge that I have missed). The index for "Offered_dt" will only be useful if there are lots of records that aren't from 2006.
At the conference last week, I saw a statement that had been rewritten from
FROM "vw_ExportForTableau$" vw
LEFT JOIN "zip_codes$" zip ON vw."Zip_cd" = zip."Zip"
to
FROM "vw_ExportForTableau$" vw
LEFT JOIN (select * from "zip_codes$" order by "Zip") zip ON vw."Zip_cd" = zip."Zip"
(or similar, it was at least an inner join rather than a left join)
but even though that made it a bit quicker in the particular case where it had been measured, Arno (I asked mr. Optimizer a bit later) said he didn't think this was a general way to speed up queries. Moreover, it makes things considerably harder to read, so I wouldn't recommend to try this unless every millisecond of performance counts.
And my guess is that the main reason for your performance troubles are as simple as a missing index for zip."Zip".
HTH,
Set
-----Original Message-----
From: firebird-support@yahoogroups.com [mailto:firebird-support@yahoogroups.com] On Behalf Of Richard Wesley
Sent: 20. oktober 2007 00:04
To: firebird-support@yahoogroups.com
Subject: [firebird-support] Terrible left join performance
Hi All -
I have just been given a case where Firebird falls off a cliff
compared to Excel (using Jet). The culprit seems to be a left join.
SELECT ("vw_ExportForTableau"."Main Type") AS "none:Main Type:nk",
("vw_ExportForTableau"."Offered_dt") AS "none:Offered_dt:qk",
(AVG(CAST("vw_ExportForTableau"."CapRate" AS DOUBLE PRECISION)))
AS "avg:CapRate:qk"
FROM "vw_ExportForTableau$" "vw_ExportForTableau"
LEFT JOIN "zip_codes$" "zip_codes" ON
("vw_ExportForTableau"."Zip_cd" = "zip_codes"."Zip")
WHERE ((NOT (("vw_ExportForTableau"."Main Type") IN ( 'Dev Site',
'Mixed')) ) AND ((EXTRACT(YEAR FROM
"vw_ExportForTableau"."Offered_dt")) = 2006) )
GROUP BY 1,
2;
<PLAN>
PLAN SORT (JOIN (vw_ExportForTableau NATURAL, zip_codes NATURAL))
</PLAN>
[Time] Running the query took 272.2820 sec.
[Time] Getting the records took 0.0450 sec.
[Count] Query returned 1379 records.
By comparison, Excel executes the same query on the same data in half
a second:
SELECT ([_TableauRelation_].[Main Type]) AS [none:Main Type:nk],
([_TableauRelation_].[Offered_dt]) AS [none:Offered_dt:qk],
(AVG([_TableauRelation_].[CapRate])) AS [avg:CapRate:qk]
FROM (
SELECT [vw_ExportForTableau].[PercentOfAsk] AS [PercentOfAsk],
<long list of fields because of Jet's join rules...>
FROM [vw_ExportForTableau$] [vw_ExportForTableau]
LEFT JOIN [zip_codes$] [zip_codes] ON ([vw_ExportForTableau].
[Zip_cd] = [zip_codes].[Zip])
) [_TableauRelation_]
WHERE ((NOT (([_TableauRelation_].[Main Type]) IN ('Dev Site',
'Mixed')) ) AND ((DATEPART('yyyy',[_TableauRelation_].[Offered_dt]))
= 2006) )
GROUP BY ([_TableauRelation_].[Main Type]),
([_TableauRelation_].[Offered_dt]);
[Time] Running the query took 0.4285 sec.
[Time] Getting the records took 0.0164 sec.
[Count] Query returned 1379 records.
To make it worse, the Firebird tables are fully indexed.
I can't give the full schema as it is customer data, but one table is
a generic zip code table and looks like this:
CREATE TABLE "zip_codes$" (
"City" VARCHAR(26) CHARACTER SET UTF8 COLLATE UNICODE,
"County" VARCHAR(25) CHARACTER SET UTF8 COLLATE UNICODE,
"Latitude" FLOAT(53),
"Longitude" FLOAT(53),
"State" VARCHAR(2) CHARACTER SET UTF8 COLLATE UNICODE,
"Zip Class" VARCHAR(11) CHARACTER SET UTF8 COLLATE UNICODE,
"Zip" FLOAT(53)
);
INSERT INTO "zip_codes$"
("City", "County", "Latitude", "Longitude", "State", "Zip
Class", "Zip")
VALUES(?, ?, ?, ?, ?, ?, ?);
Inserted 42741 rows in 6.426 seconds
The other table is a very wide and cutting the fields down to those
in the query:
CREATE TABLE "vw_ExportForTableau$" (
"CapRate" FLOAT(53),
"Main Type" VARCHAR(10) CHARACTER SET UTF8 COLLATE UNICODE,
"Offered_dt" TIMESTAMP,
"Zip_cd" FLOAT(53)
);
INSERT INTO "vw_ExportForTableau$"
("CapRate", "Main Type", "Offered_dt", "Zip_cd")
VALUES(?, ?, ?, ?)
Inserted 36630 rows in 37.560 seconds
CREATE INDEX "_tidx_128_31a" ON "vw_ExportForTableau$" ("Offered_dt");
CREATE INDEX "_tidx_128_22a" ON "vw_ExportForTableau$" ("Main Type");
Any ideas on why this is so much slower? Could it be the float field
used for the join? No better in 2.1b2.