Subject | RE: [firebird-support] Deduplicating data |
---|---|
Author | Svein Erling Tysvær |
Post date | 2013-01-11T11:11:23Z |
>I've been importing data from an old system, part of which contains a big table full of text. The tableHi Scott!
>contains around 1.5 million entries, but a lot of them are duplicate to the the extent that unique entries
>total less than 10,000.
>
>It's pretty easy to create a list of these duplicate entries using a GROUP BY query:
>
>INSERT INTO DUP_LIST(DUP_COUNT, NEW_ID, TEXT)
> SELECT COUNT(1), MIN(TEXT_ID), TEXT
> FROM TEXT_TAB
> GROUP BY TEXT
> HAVING COUNT(1) > 1;
>
>And that takes about 15 mins to run.
>
>However I need a mapping between the old duplicate TEXT_IDs and the selected unique one so I can change
>entries in the tables that reference the TEXT_TAB (I'm using the MIN(TEXT_ID) to select that in the above
>query):
>
>INSERT INTO DUP_MAP(OLD_TEXT_ID, NEW_TEXT_ID)
> SELECT a.TEXT_ID, b.NEW_ID
> FROM TEXT_TAB a
> JOIN DUP_LIST b
> ON a.TEXT_ID > b.NEW_ID
> AND a.TEXT = b.TEXT;
>
>That query takes over 3 hours.
>
>I'm pretty sure I'm doing something stupid here, that second query is obviously very bad. The first query
>touches on all the info I need (duplicate IDs) but I can't see how to extract that information in a useful
>way. Is it possible?
I don't see anything stupid, though of course, you would not want an index to be used for a.TEXT_ID (since that will only eliminate the first out of an average of 150 duplicates. Depending on your plan, a slight change might make the query a bit quicker:
INSERT INTO DUP_MAP(OLD_TEXT_ID, NEW_TEXT_ID)
SELECT a.TEXT_ID, b.NEW_ID
FROM TEXT_TAB a
JOIN DUP_LIST b
ON a.TEXT_ID+0 > b.NEW_ID
AND a.TEXT = b.TEXT;
Another, more important thing is whether you have an index on DUP_LIST.TEXT. If it is too long for an index, then create an index covering part of the field and add to your query as well if necessary (I hardly ever need expression indexes or calculated indexed fields, so I'm uncertain what you have to add).
I agree that inserting less than 150 records per second sounds like it could be improved.
HTH,
Set