Subject | RE: [firebird-support] Re: detect duplicate blobs, how to? |
---|---|
Author | Richard Damon |
Post date | 2017-02-09T22:18:41Z |
From: firebird-support@yahoogroups.com [mailto:firebird-support@yahoogroups.com]
Sent: Thursday, February 9, 2017 11:33 AM
To: firebird-support@yahoogroups.com
Subject: RE: [firebird-support] Re: detect duplicate blobs, how to?
> You are aware of course that you can't use any hashing function on its own toActually, if you choose the right hash function you can detect duplicates.
> detect duplicates? - the best you can do is detect *probable* duplicates,
If you create a UDF based on/using SHA256, the result would be unique (with a 2^256 certainty) -- there is no known collision of a SHA256 hash (https://en.wikipedia.org/wiki/Hash_function_security_summary).
Sean
Even SHA256 can’t eliminate all possibility of a duplicate. If you have files of more than 256 bits in them, by the pigeon hole principle, there WILL be duplicates within the universe of all possible files. There HAS to be. The probability is very low (but not zero) if you keep you set of files below 2^128 members, but it is NOT 0. The key property of a has like SHA256 is that given a hash value, you cannot create a file (other than by brute force) that will yield that hash value. When using a hash, you need to decide if the chance of a false positive on match. With a good large hash, that probability gets very small, so maybe you can assume it is perfect. I would likely still compare the files to be sure, since you will likely only occur that cost if you do have a duplicate. ,_._,___