Subject Re: SV: [Firebird-Architect] Indexes for big objects
Author Roman Rokytskyy
> > > What many people would really want is quick search for value
> > > parts in
> > > long values (STARTING WITH, CONTAINING), not exact match for
> > > long value.
> > > And this type of index is not applicable for this operation.
> >
> > That's a different issue- for that you want full text search and
> > there the question is whether it should be field/table based or
> > cross field and tables - also it needs proximity and some of the
> > other rules that search engines use to prioritize results.
>
> I agree it's a different issue, but one which is much more common.
>
> By creating a new "keyword" index (indexing the individual words in
> a field/blob), we would go a long way to starting on the road of a
> full-text solution (which I think of as a cross table/field index --
> so it would be a definition problem not implementation), while at
> the same time providing a feature which would likely be used by a
> LOT of FB users/developers.

Let me put some words about the text search issue.

First, let's agree that using normal hashing algorithm to detect spam
is useless - any difference in formatting will not be detected. So,
the only "usable" use case for the normal hashing is to "help" user to
avoid duplicates (MD5 and SHA1 will ensure good results - for the most
real-world cases they will provide unique values). But for this use
case I agree with Pavel and Dmitry - it does not belong in Firebird as
it can be easily implemented using UDF.

Now about spam detection, or better to say similarity detection of two
texts.

There are two approaches to full text search - inverted file index and
signature file index (see also my article in IB Developer magazine No.
3), however the signature file index was only briefly mentioned there.

What is signature file index? It is an algorithm (or group of
algorithms) that compute a hash for a given text but in a such a way,
that each word in the text will set exactly one bit of a hash to 1.
Usualy a 128-bit hashes are computed and naturally there will be many
collisions which are "resolved" using the boolean OR: if some bit is
already set to 1 because word A appeared in text, and parser finds
word B that sets the same bit, the bit remains set.

The search part then happens in the following way: engine computes the
hash of the search phrase and AND-s this hash with the hashes of each
document. Those that produce values > 0 are our candidate matches.
These candidates are loaded from the disk and are processed to filter
out the false positives.

Research publications show that for the full text search the inverted
file approach is more efficient than signature file.

However, it seems that if we talk about detecting similar texts, the
signature file approach might be usefull. The index will allow also
range queries, namely "give me all documents where similarity to the
specified one is greater than 0.99", etc. But I guess this is no
longer simple task and will require some resources/investment.

Considering that we are not (yet?) playing on the Web-apps field where
the feature would be requirement the most, it gets "low priority" from
me (see also the voting results for full-text search on
FirebirdNews.org poll).

Roman