Subject Fast massive delete with complex criteria
Author Josef Kokeš
Hi!

I find myself in a situation where I need to delete large amount of data
from a huge (comparably) table, with relatively complex criteria on what
stays and what gets deleted:

- The table in question has about 6 million rows.
- The rows contain a mix of integer, timestamp and blob columns, with
primary key A (integer) and secondary keys B (integer), C (integer) and
D (timestamp).
- I want to keep at least one record for every combination of B and C:
a) the newest (HAVING D = MAX(D))
b) within the last 30 days ((D+30) > 'now')

I have been trying a number of approaches, all of which turned out to be
unacceptably slow:

a) A stored procedure which loops though a SELECT B, C, MAX(D) GROUP BY
B, C, then DELETEs records with B=FoundB AND C=FoundC AND D<MaxD AND
(D+30)<'now'. I lost patience after 90 minutes or so.

b) Add a new integer column "keep_me" to the table, then use two updates
(#1: WHERE (D+30)>='now'; #2: WHERE A IN (SELECT MAX(A) GROUP BY B, C))
to set it to 1 for my required records, DELETE all records with keep_me
IS NULL, then drop the extra column. This is still running in the second
UPDATE.

c) Apparently a DROP TABLE is fast enough, so my next approach (if b
fails) is to create a new temporary table, INSERT records which I want
to keep into it, DROP the original table, recreate it, INSERT records
back and finally DROP the temporary table.

My questions are:

1) Am I overlooking some possible fast approach?
2) Do you think c has a reasonable chance of being the fastest?

Thanks,

Pepak