Subject | Help with Duplicate check algorithm |
---|---|
Author | Don Gollahon |
Post date | 2003-08-12T21:33:23Z |
We process millions of phone calls for many phone companies. One major
problem we have is with duplicate checks during import of data files. I can
toggle this off and the insert speed goes up about 5 times! Note that this
is not dup check of the primary keys. It is a check to make sure the vendor
did not send us the same file twice or overlap data from one file to the
next. We find phone calls that are apparently from the same phone # at the
same time to the same place and compare about a dozen other fields. We
cannot throw out dups because we have to give an account of all records in
the files. We just mark them as dups so they will not be pullied out for
processing later.
Currently we do a lookup before inserting each record. If there is another
call that matches date, time, from and to #s then we compare several other
fields. If any of them are different then we count it as a new call
otherwise we mark it as a dup. The locate (query) slows processing down
very much since there can be millions of calls. Inserts are at 16 records
per second with dup check on. It is at 50 to 80 recs per second with dup
check off.
I have a plan to try that could improve this. It involves removing the dup
check from the import process and making it a post processing method. I
could run a query that views all data between 2 dates in order of call_date,
time, from, and to #. Scan this result once and on records where these 4
fields are the same as the previous record, compae the other main fields and
mark accordingly. This way no lookups are done and the data is scanned
once.
Any ideas or comments on this would be appreciated.
Delphi 5 pro, FB 1.5, Windows 2000 servers.
Don Gollahon
gollahon@...
"The Original GenSoft Prodigal"
problem we have is with duplicate checks during import of data files. I can
toggle this off and the insert speed goes up about 5 times! Note that this
is not dup check of the primary keys. It is a check to make sure the vendor
did not send us the same file twice or overlap data from one file to the
next. We find phone calls that are apparently from the same phone # at the
same time to the same place and compare about a dozen other fields. We
cannot throw out dups because we have to give an account of all records in
the files. We just mark them as dups so they will not be pullied out for
processing later.
Currently we do a lookup before inserting each record. If there is another
call that matches date, time, from and to #s then we compare several other
fields. If any of them are different then we count it as a new call
otherwise we mark it as a dup. The locate (query) slows processing down
very much since there can be millions of calls. Inserts are at 16 records
per second with dup check on. It is at 50 to 80 recs per second with dup
check off.
I have a plan to try that could improve this. It involves removing the dup
check from the import process and making it a post processing method. I
could run a query that views all data between 2 dates in order of call_date,
time, from, and to #. Scan this result once and on records where these 4
fields are the same as the previous record, compae the other main fields and
mark accordingly. This way no lookups are done and the data is scanned
once.
Any ideas or comments on this would be appreciated.
Delphi 5 pro, FB 1.5, Windows 2000 servers.
Don Gollahon
gollahon@...
"The Original GenSoft Prodigal"