firebird-architect - Re: [Firebird-Architect] Full Text Search

Subject	Re: [Firebird-Architect] Full Text Search
Author	Jim Starkey
Post date	2005-02-05T19:09:17Z

Lester Caine wrote:

>Hopefully this starts fleshing out the question - "What is Full Text
>Search?"
>I hope we are going down the route of a proper search engine, supported
>by internal functions that enable that to work efficiently. At the very
>least I see a need to create a FULL TEXT index that contains a
>dictionary of words, and pointers to the location of each occurrence of
>the word. Since the identification of records is a major part of this, I
>don't see that a 'fully automatic' implementation is possible unless we
>introduce a string of restrictions, as in the MySQL case. What I think
>we are probably looking for is a set of ground rules against which other
>functions can be implemented?
>
>

Most database people approach search as a SQL boolean. This is too
restrictive to be useful. Think Google, not SQL.

Unless Monty has had a major change in attitude, don't use MySQL as a model.

Base on my experience with database based search:

1. The general case is multi-table, multi-column search, which will
need an API extension.
2. Search should be restrictable to an explicit set of tables
3. Multi-term and word proximity search must be supported (this
includes phrases)
4. Word stem matching should be supported (ie support*)
5. Alta Vista rules for case matching are nice, but probably not
required
6. Support for application specific "stop" words (words too common to
fetch from index -- a, the, and, etc.) is important
7. Result ordering by "score" is very important.
8. A filter boolean operator that can be mixed with SQL boolean
operators is important.
9. Search indexing should be html-aware

I use a model where any field can be declared "searchable" (other fields
are not considered in search). I keep a single search index per
database. I'm considering changing this to search index per schema, but
haven't yet made up my mind.

>A simple starting point from my viewpoint is a convention on
>numbering/identifying words in a document, be that a BLOB or an external
>file, so that we have a standard pointer method. Doc27/Word32 - so that
>if two words are being looked for we can match adjacent occurrences or
>any occurrence of the words.
>
>

Exhaustive search is too expensive to even think about. Given a word
index with word position, it's cheaper to evaluate a given record
against the index than to apply an operator against a blob. Computing a
record number bitmap from the word index is definitely the way to go.

>How does this fit in with other peoples thinking?
>
>
>

This is an area where the database world has dropped the ball big time.
The entire world navigates the web starting at Google, but no mainstream
database system supports the most basic multi-table search. Did the
database world miss the web altogether?

--

Jim Starkey
Netfrastructure, Inc.
978 526-1376