Subject | Deduplicating data |
---|---|
Author | Scott Morgan |
Post date | 2013-01-11T10:40Z |
I've been importing data from an old system, part of which contains a
big table full of text. The table contains around 1.5 million entries,
but a lot of them are duplicate to the the extent that unique entries
total less than 10,000.
It's pretty easy to create a list of these duplicate entries using a
GROUP BY query:
INSERT INTO DUP_LIST(DUP_COUNT, NEW_ID, TEXT)
SELECT COUNT(1), MIN(TEXT_ID), TEXT
FROM TEXT_TAB
GROUP BY TEXT
HAVING COUNT(1) > 1;
And that takes about 15 mins to run.
However I need a mapping between the old duplicate TEXT_IDs and the
selected unique one so I can change entries in the tables that reference
the TEXT_TAB (I'm using the MIN(TEXT_ID) to select that in the above query):
INSERT INTO DUP_MAP(OLD_TEXT_ID, NEW_TEXT_ID)
SELECT a.TEXT_ID, b.NEW_ID
FROM TEXT_TAB a
JOIN DUP_LIST b
ON a.TEXT_ID > b.NEW_ID
AND a.TEXT = b.TEXT;
That query takes over 3 hours.
I'm pretty sure I'm doing something stupid here, that second query is
obviously very bad. The first query touches on all the info I need
(duplicate IDs) but I can't see how to extract that information in a
useful way. Is it possible?
Scott
big table full of text. The table contains around 1.5 million entries,
but a lot of them are duplicate to the the extent that unique entries
total less than 10,000.
It's pretty easy to create a list of these duplicate entries using a
GROUP BY query:
INSERT INTO DUP_LIST(DUP_COUNT, NEW_ID, TEXT)
SELECT COUNT(1), MIN(TEXT_ID), TEXT
FROM TEXT_TAB
GROUP BY TEXT
HAVING COUNT(1) > 1;
And that takes about 15 mins to run.
However I need a mapping between the old duplicate TEXT_IDs and the
selected unique one so I can change entries in the tables that reference
the TEXT_TAB (I'm using the MIN(TEXT_ID) to select that in the above query):
INSERT INTO DUP_MAP(OLD_TEXT_ID, NEW_TEXT_ID)
SELECT a.TEXT_ID, b.NEW_ID
FROM TEXT_TAB a
JOIN DUP_LIST b
ON a.TEXT_ID > b.NEW_ID
AND a.TEXT = b.TEXT;
That query takes over 3 hours.
I'm pretty sure I'm doing something stupid here, that second query is
obviously very bad. The first query touches on all the info I need
(duplicate IDs) but I can't see how to extract that information in a
useful way. Is it possible?
Scott