Subject Re: [firebird-support] Deduplicating data
Author Scott Morgan
On 11/01/13 11:11, Svein Erling Tysvær wrote:
> I don't see anything stupid, though of course, you would not want an index to be used for a.TEXT_ID (since that will only eliminate the first out of an average of 150 duplicates. Depending on your plan, a slight change might make the query a bit quicker:
>
> INSERT INTO DUP_MAP(OLD_TEXT_ID, NEW_TEXT_ID)
> SELECT a.TEXT_ID, b.NEW_ID
> FROM TEXT_TAB a
> JOIN DUP_LIST b
> ON a.TEXT_ID+0 > b.NEW_ID
> AND a.TEXT = b.TEXT;

Gave it a try, but it made no difference.

I tried a different route. I added a hash field to the relevant tables
and created an index on the TEXT_TAB.TEXTHASH. I still needed to compare
text fields as there were some hash collisions.

The difference is amazing, the initial listing of duplicate phrases
happens in about half the time now (8 mins against 15 mins), but the
production of the duplicate map is even better (5 mins against 3+
hours!). The overhead of creating the hashes and index are only a few
minutes, so a massive performance improvement overall.

Scott