Subject RE: [firebird-support] Re: detect duplicate blobs, how to?
Author Richard Damon

 

 

From: firebird-support@yahoogroups.com [mailto:firebird-support@yahoogroups.com]
Sent: Thursday, February 9, 2017 11:33 AM
To: firebird-support@yahoogroups.com
Subject: RE: [firebird-support] Re: detect duplicate blobs, how to?

 

 



> You are aware of course that you can't use any hashing function on its own to
> detect duplicates? - the best you can do is detect *probable* duplicates,

Actually, if you choose the right hash function you can detect duplicates.

If you create a UDF based on/using SHA256, the result would be unique (with a 2^256 certainty) -- there is no known collision of a SHA256 hash (https://en.wikipedia.org/wiki/Hash_function_security_summary).


Sean

Even SHA256 can’t eliminate all possibility of a duplicate. If you have files of more than 256 bits in them, by the pigeon hole principle, there WILL be duplicates within the universe of all possible files. There HAS to be. The probability is very low (but not zero) if you keep you set of files below 2^128 members, but it is NOT 0. The key property of a has like SHA256 is that given a hash value, you cannot create a file (other than by brute force) that will yield that hash value. When using a hash, you need to decide if the chance of a false positive on match. With a good large hash, that probability gets very small, so maybe you can assume it is perfect. I would likely still compare the files to be sure, since you will likely only occur that cost if you do have a duplicate.   ,_._,___