Subject Deduplicating data
Author Scott Morgan
I've been importing data from an old system, part of which contains a
big table full of text. The table contains around 1.5 million entries,
but a lot of them are duplicate to the the extent that unique entries
total less than 10,000.

It's pretty easy to create a list of these duplicate entries using a
GROUP BY query:

INSERT INTO DUP_LIST(DUP_COUNT, NEW_ID, TEXT)
SELECT COUNT(1), MIN(TEXT_ID), TEXT
FROM TEXT_TAB
GROUP BY TEXT
HAVING COUNT(1) > 1;

And that takes about 15 mins to run.

However I need a mapping between the old duplicate TEXT_IDs and the
selected unique one so I can change entries in the tables that reference
the TEXT_TAB (I'm using the MIN(TEXT_ID) to select that in the above query):

INSERT INTO DUP_MAP(OLD_TEXT_ID, NEW_TEXT_ID)
SELECT a.TEXT_ID, b.NEW_ID
FROM TEXT_TAB a
JOIN DUP_LIST b
ON a.TEXT_ID > b.NEW_ID
AND a.TEXT = b.TEXT;

That query takes over 3 hours.

I'm pretty sure I'm doing something stupid here, that second query is
obviously very bad. The first query touches on all the info I need
(duplicate IDs) but I can't see how to extract that information in a
useful way. Is it possible?

Scott