Subject | Re: avoiding duplicates |
---|---|
Author | Adam |
Post date | 2005-07-13T00:18:51Z |
Jason,
By that logic, an index is also a normalisation failure. You do not
actually need an index to do any function, it is simply a
representation of the actual table data that can be used to speed up
operations such as conditions, joins and sorts.
No-one likes duplicated information in table fields, and ideally you
could create a case insensitive index on the field that would make
the upper comparison efficient.
But you have to ask yourself the point of a particular normalisation.
There are two ways to solve the particular problem of case
insensitive duplicates.
1. As you have described, create a check constraint on the table that
uses something like the Upper function to check there is no record
already there.
Pros:
No duplicates
Simple to follow logic
Cons:
It can not use an index, so before any insert is done the entire
table needs to be scanned. That means as time goes on and the table
fills up, inserts will take longer and longer. With 20 records, it
will be fast, with 1000 you will start to notice, at 10000 ouch.
2. Create a proxy field whose sole purpose is to store some case
insensitive version.
Pros:
An index can be created so that insert operations do not take
exponentially longer as the table grows.
Very little modification needed to the table structure
Cons:
Have to write a trigger before insert and before update to fill this
column
Each record will be slightly larger to store (however you would never
transfer it over the wire).
An additional index would need to be maintained.
Obviously in this case option 2 would be by far the most efficient
resource wise. I have also used a hybrid on option 2 where large (or
multiple) fields are involved. Basically, a hash is generated on the
field(s) of interest, and a non unique index is created on the hash.
Although it is not guaranteed unique, it helps a lot to narrow down
to similar records.
Adam
By that logic, an index is also a normalisation failure. You do not
actually need an index to do any function, it is simply a
representation of the actual table data that can be used to speed up
operations such as conditions, joins and sorts.
No-one likes duplicated information in table fields, and ideally you
could create a case insensitive index on the field that would make
the upper comparison efficient.
But you have to ask yourself the point of a particular normalisation.
There are two ways to solve the particular problem of case
insensitive duplicates.
1. As you have described, create a check constraint on the table that
uses something like the Upper function to check there is no record
already there.
Pros:
No duplicates
Simple to follow logic
Cons:
It can not use an index, so before any insert is done the entire
table needs to be scanned. That means as time goes on and the table
fills up, inserts will take longer and longer. With 20 records, it
will be fast, with 1000 you will start to notice, at 10000 ouch.
2. Create a proxy field whose sole purpose is to store some case
insensitive version.
Pros:
An index can be created so that insert operations do not take
exponentially longer as the table grows.
Very little modification needed to the table structure
Cons:
Have to write a trigger before insert and before update to fill this
column
Each record will be slightly larger to store (however you would never
transfer it over the wire).
An additional index would need to be maintained.
Obviously in this case option 2 would be by far the most efficient
resource wise. I have also used a hybrid on option 2 where large (or
multiple) fields are involved. Basically, a hash is generated on the
field(s) of interest, and a non unique index is created on the hash.
Although it is not guaranteed unique, it helps a lot to narrow down
to similar records.
Adam