Subject | update with subselect, speed / optimization (data repair) |
---|---|
Author | unordained |
Post date | 2003-07-22T00:04:56Z |
table layout (important parts)
tax_entities
- id (pk)
- ...
bills
- id (pk)
- provider_tax_id (fk to tax_entities.id)
- group_tax_id (fk to tax_entities.id)
- facility_tax_id (fk to tax_entities.id)
- actual_tax_id (fk to tax_entities.id)
- ...
transactions
- id (pk)
- bill_id (fk to bills.id)
- tax_entity_id (fk to tax_entities.id)
- ...
"actual_tax_id" indicates who should have been paid on the bill. it will be equal to one of the
other three fk values (they're present just FYI, basically.) at some point, some transactions were
given the wrong tax_entity_id: they received the value of provider_tax_id rather than
actual_tax_id, when the two were different.
the only case i want to fix is the following:
if transactions.tax_entity_id = bills.provider_tax_id on the associated bill, i want to reset the
fk to point to actual_tax_id. (it is extremely doubtful that a bill would have transactions on it
not intended for payment of the provider, yet somehow reference a provider mentioned in one of the
3 named fk's ... thus i'm not worried about this updating more rows than absolutely necessary:
those should just get reset to the original value.)
attempting the following statement ...
update transactions set transactions.tax_entity_id = (select bills.actual_tax_id from bills where
bills.id = transactions.bill_id) where transactions.tax_entity_id = (select bills.provider_tax_id
from bills where bills.id = transactions.bill_id);
... results in the following plan:
PLAN (BILLS INDEX (RDB$PRIMARY249))
PLAN (BILLS NATURAL)
PLAN (TRANSACTIONS NATURAL)
attempting to do it in two steps (set all of the values to be updated to -1 [with a dummy record in
tax_entities to satisfy constraints], then run a second update to fetch the correct fk value) still
results in a slow second pass (first one runs in about 2 seconds, second ... ran for about 2 hours
before i killed it.)
as follows:
update transactions set transactions.tax_entity_id = -1 where transactions.tax_entity_id = (select
bills.provider_tax_id from bills where bills.id = transactions.bill_id);
with plan:
PLAN (BILLS INDEX (RDB$PRIMARY249))
PLAN (TRANSACTIONS NATURAL)
then
update transactions set transactions.tax_entity_id = (select bills.actual_tax_id from bills where
bills.id = transactions.bill_id) where transactions.tax_entity_id = -1;
with plan:
PLAN (BILLS NATURAL)
PLAN (TRANSACTIONS INDEX (RDB$FOREIGN265))
i suppose i could join transactions to bills into a temporary table, run simple updates there (if
value is different, set to value in another column) then re-insert the entire thing into the
transactions table ... but it seems like there should be an easier, fast way.
"transactions" has 112363 rows, of which 88660 refer to bills; "bills" has 31729 rows, of which
5351 have provider_tax_id != actual_tax_id, with the potential to need fixing; there are 5674 rows
in the transactions table that appear to require updating. perhaps next i'll try selecting all of
those transactions.id into a temp table, then run the update only where the id is in that temp
table -- but that seems likely to not run quickly either.
in this case, i want to update a table based on inner-join data to a parent table. i'm not updating
two tables at once. however, firebird won't allow me to create a view which joins bills and
transactions together, then updates the view -- because there's the potential risk that i would try
to update multiple tables at once. (even though that's not the case.) any other suggestions?
firebird 1.0x superserver on linux (gig of ram, 1500+ mhz, etc. test machine)
thanks!
-philip
tax_entities
- id (pk)
- ...
bills
- id (pk)
- provider_tax_id (fk to tax_entities.id)
- group_tax_id (fk to tax_entities.id)
- facility_tax_id (fk to tax_entities.id)
- actual_tax_id (fk to tax_entities.id)
- ...
transactions
- id (pk)
- bill_id (fk to bills.id)
- tax_entity_id (fk to tax_entities.id)
- ...
"actual_tax_id" indicates who should have been paid on the bill. it will be equal to one of the
other three fk values (they're present just FYI, basically.) at some point, some transactions were
given the wrong tax_entity_id: they received the value of provider_tax_id rather than
actual_tax_id, when the two were different.
the only case i want to fix is the following:
if transactions.tax_entity_id = bills.provider_tax_id on the associated bill, i want to reset the
fk to point to actual_tax_id. (it is extremely doubtful that a bill would have transactions on it
not intended for payment of the provider, yet somehow reference a provider mentioned in one of the
3 named fk's ... thus i'm not worried about this updating more rows than absolutely necessary:
those should just get reset to the original value.)
attempting the following statement ...
update transactions set transactions.tax_entity_id = (select bills.actual_tax_id from bills where
bills.id = transactions.bill_id) where transactions.tax_entity_id = (select bills.provider_tax_id
from bills where bills.id = transactions.bill_id);
... results in the following plan:
PLAN (BILLS INDEX (RDB$PRIMARY249))
PLAN (BILLS NATURAL)
PLAN (TRANSACTIONS NATURAL)
attempting to do it in two steps (set all of the values to be updated to -1 [with a dummy record in
tax_entities to satisfy constraints], then run a second update to fetch the correct fk value) still
results in a slow second pass (first one runs in about 2 seconds, second ... ran for about 2 hours
before i killed it.)
as follows:
update transactions set transactions.tax_entity_id = -1 where transactions.tax_entity_id = (select
bills.provider_tax_id from bills where bills.id = transactions.bill_id);
with plan:
PLAN (BILLS INDEX (RDB$PRIMARY249))
PLAN (TRANSACTIONS NATURAL)
then
update transactions set transactions.tax_entity_id = (select bills.actual_tax_id from bills where
bills.id = transactions.bill_id) where transactions.tax_entity_id = -1;
with plan:
PLAN (BILLS NATURAL)
PLAN (TRANSACTIONS INDEX (RDB$FOREIGN265))
i suppose i could join transactions to bills into a temporary table, run simple updates there (if
value is different, set to value in another column) then re-insert the entire thing into the
transactions table ... but it seems like there should be an easier, fast way.
"transactions" has 112363 rows, of which 88660 refer to bills; "bills" has 31729 rows, of which
5351 have provider_tax_id != actual_tax_id, with the potential to need fixing; there are 5674 rows
in the transactions table that appear to require updating. perhaps next i'll try selecting all of
those transactions.id into a temp table, then run the update only where the id is in that temp
table -- but that seems likely to not run quickly either.
in this case, i want to update a table based on inner-join data to a parent table. i'm not updating
two tables at once. however, firebird won't allow me to create a view which joins bills and
transactions together, then updates the view -- because there's the potential risk that i would try
to update multiple tables at once. (even though that's not the case.) any other suggestions?
firebird 1.0x superserver on linux (gig of ram, 1500+ mhz, etc. test machine)
thanks!
-philip