Subject Re: [Firebird-Architect] Full Text Search
Author Lester Caine
Jim Starkey wrote:

> Lester Caine wrote:
>>Hopefully this starts fleshing out the question - "What is Full Text
>>Search?"
>
> Most database people approach search as a SQL boolean. This is too
> restrictive to be useful. Think Google, not SQL.
I think we agree there
And I want a mechanism that can be expanded to cover content of external
files.

> Unless Monty has had a major change in attitude, don't use MySQL as a model.
I think I'd be looking MSSQL rather than MySQL I just included the notes
for comparison - and how not to do it ;)

> Base on my experience with database based search:
>
> 1. The general case is multi-table, multi-column search, which will
> need an API extension.
> 2. Search should be restrictable to an explicit set of tables
Limit the fields that are 'indexed' - location of a word has
table,field,record_no and position in record?
> 3. Multi-term and word proximity search must be supported (this
> includes phrases)
Contains all words has a lower score than contains all words in the same
order and contains exact phrase has the highest score?
> 4. Word stem matching should be supported (ie support*)
In addition to soundex
> 5. Alta Vista rules for case matching are nice, but probably not
> required
Character sets add to the fun :)
> 6. Support for application specific "stop" words (words too common to
> fetch from index -- a, the, and, etc.) is important
Perhaps a flag on the 'dictionary? Or a separate table with some sort of
grouping function?
> 7. Result ordering by "score" is very important.
Another area of 'fun' - last modified date of a record could be as
important. Find the recent occurrences first.
> 8. A filter boolean operator that can be mixed with SQL boolean
> operators is important.
Is that my - search existing results? I think we need a temporary table
for this with the results list, and then allow tighter searches?
> 9. Search indexing should be html-aware
And XML - filter to remove tags as an extension to the stop words?

> I use a model where any field can be declared "searchable" (other fields
> are not considered in search). I keep a single search index per
> database. I'm considering changing this to search index per schema, but
> haven't yet made up my mind.
What does the search index consist of?

>>A simple starting point from my viewpoint is a convention on
>>numbering/identifying words in a document, be that a BLOB or an external
>>file, so that we have a standard pointer method. Doc27/Word32 - so that
>>if two words are being looked for we can match adjacent occurrences or
>>any occurrence of the words.
>
> Exhaustive search is too expensive to even think about. Given a word
> index with word position, it's cheaper to evaluate a given record
> against the index than to apply an operator against a blob. Computing a
> record number bitmap from the word index is definitely the way to go.
I would expect to add entries to the search index as records are
added/updated in the database - with the extension that allows scanning
an external file when it's name or modified date is changed.

>>How does this fit in with other peoples thinking?
>>
> This is an area where the database world has dropped the ball big time.
> The entire world navigates the web starting at Google, but no mainstream
> database system supports the most basic multi-table search. Did the
> database world miss the web altogether?

I think we agree there as well, but aren't we just talking about
additional functions to help an extension to the core engine rather than
something that is a totally integral function. So that application
specific changes can be included?

--
Lester Caine
-----------------------------
L.S.Caine Electronic Services