Subject RE: [firebird-support] Re: detect duplicate blobs, how to?
Author Richard Damon

Yes, it is a matter of scale, but it is also a matter of what you promise. If you are going to promise 100% reliable detection of duplicates, just using a hash (no matter how good) is a bug, because it ISN’T 100% reliable. It might be “Reliable Enough”, but don’t say it is 100%.

 

Please also review your math, as your argument actually proved its flaw. The number 2.71 x 10^18 ISN’T a chance of collision, or even a chance of any kind. It is the number of UUIDs (of 128 bits) that would need to be generated to have a 50% chance of collision.

 

I would actually argue that the idea of using a very high quality hash, like SHA256 is taking the problem in the wrong direction. It is a somewhat expensive value to compute, and being 256 bits long (32 bytes) somewhat big to add, and index, for every file. I would be much more inclined to use a simple hash (like CRC) that gives maybe a 32 bit result, lookup the files that match (somewhat likely to have one) and do the bit by bit compare to see if it is a duplicate. That is probably faster than your method of computing the SHA256 and (possibly even if unlikely, incorrectly) assuming they are unique, the time saved in using a simpler hash, and quicker index, likely makes up for the time needed to compare the two files.

 

From: firebird-support@yahoogroups.com [mailto:firebird-support@yahoogroups.com]
Sent: Thursday, February 9, 2017 6:40 PM
To: firebird-support@yahoogroups.com
Subject: RE: [firebird-support] Re: detect duplicate blobs, how to?

 

 

Richard,

> Sean
> Even SHA256 can’t eliminate all possibility of a duplicate. If you have files of
> more than 256 bits in them, by the pigeon hole principle, there WILL be
> duplicates within the universe of all possible files. There HAS to be. The
> probability is very low (but not zero) if you keep you set of files below 2^128
> members, but it is NOT 0.

I agree that there is a possibility, but it is all about understanding scale.

Consider, 2^128 is the same number space that is used to represent UUID and GUID values, and the changes of a collision is 2.71 x 10^18 (https://en.wikipedia.org/wiki/Universally_unique_identifier )

Consider, the number of words ever spoken by human beings = 5 Exabytes (< 2^63) (http://highscalability.com/blog/2012/9/11/how-big-is-a-petabyte-exabyte-zettabyte-or-a-yottabyte.html), so what are the chances that more than 2^128 files have been created?

Consider, the probability that a rogue asteroid crashes on Earth *within the next second*, obliterating civilization-as-we-know-it is about 10^-15. Which is 45 **orders of magnitude** more probable than the SHA-256 collision. (http://stackoverflow.com/questions/4014090/is-it-safe-to-ignore-the-possibility-of-sha-collisions-in-practice)

IMO, the changes of a collision are for all practical purposes 0.

But if you still think there is a chance, then use SHA512.


Sean

___