Subject | Fast massive delete with complex criteria |
---|---|
Author | Josef Kokeš |
Post date | 2011-12-27T11:09:25Z |
Hi!
I find myself in a situation where I need to delete large amount of data
from a huge (comparably) table, with relatively complex criteria on what
stays and what gets deleted:
- The table in question has about 6 million rows.
- The rows contain a mix of integer, timestamp and blob columns, with
primary key A (integer) and secondary keys B (integer), C (integer) and
D (timestamp).
- I want to keep at least one record for every combination of B and C:
a) the newest (HAVING D = MAX(D))
b) within the last 30 days ((D+30) > 'now')
I have been trying a number of approaches, all of which turned out to be
unacceptably slow:
a) A stored procedure which loops though a SELECT B, C, MAX(D) GROUP BY
B, C, then DELETEs records with B=FoundB AND C=FoundC AND D<MaxD AND
(D+30)<'now'. I lost patience after 90 minutes or so.
b) Add a new integer column "keep_me" to the table, then use two updates
(#1: WHERE (D+30)>='now'; #2: WHERE A IN (SELECT MAX(A) GROUP BY B, C))
to set it to 1 for my required records, DELETE all records with keep_me
IS NULL, then drop the extra column. This is still running in the second
UPDATE.
c) Apparently a DROP TABLE is fast enough, so my next approach (if b
fails) is to create a new temporary table, INSERT records which I want
to keep into it, DROP the original table, recreate it, INSERT records
back and finally DROP the temporary table.
My questions are:
1) Am I overlooking some possible fast approach?
2) Do you think c has a reasonable chance of being the fastest?
Thanks,
Pepak
I find myself in a situation where I need to delete large amount of data
from a huge (comparably) table, with relatively complex criteria on what
stays and what gets deleted:
- The table in question has about 6 million rows.
- The rows contain a mix of integer, timestamp and blob columns, with
primary key A (integer) and secondary keys B (integer), C (integer) and
D (timestamp).
- I want to keep at least one record for every combination of B and C:
a) the newest (HAVING D = MAX(D))
b) within the last 30 days ((D+30) > 'now')
I have been trying a number of approaches, all of which turned out to be
unacceptably slow:
a) A stored procedure which loops though a SELECT B, C, MAX(D) GROUP BY
B, C, then DELETEs records with B=FoundB AND C=FoundC AND D<MaxD AND
(D+30)<'now'. I lost patience after 90 minutes or so.
b) Add a new integer column "keep_me" to the table, then use two updates
(#1: WHERE (D+30)>='now'; #2: WHERE A IN (SELECT MAX(A) GROUP BY B, C))
to set it to 1 for my required records, DELETE all records with keep_me
IS NULL, then drop the extra column. This is still running in the second
UPDATE.
c) Apparently a DROP TABLE is fast enough, so my next approach (if b
fails) is to create a new temporary table, INSERT records which I want
to keep into it, DROP the original table, recreate it, INSERT records
back and finally DROP the temporary table.
My questions are:
1) Am I overlooking some possible fast approach?
2) Do you think c has a reasonable chance of being the fastest?
Thanks,
Pepak