Subject Re: [Firebird-Architect] Full Text Search
Author Jim Starkey
Lester Caine wrote:

>And I want a mechanism that can be expanded to cover content of external
>files.
>
>
Not feasible. Search requires maintenence of a word index. There's no
way to track changes to external tables to update the index.

>Limit the fields that are 'indexed' - location of a word has
>table,field,record_no and position in record?
>
>
Yes.

>> 4. Word stem matching should be supported (ie support*)
>>
>>
>In addition to soundex
>
>
I think you want to think very, very carefully about this. Soundex
would mean two index entries per word. When you were doing a retrieval,
would you do one with the actual words and one with the soundex (which
is necessary noisy due to the algorithm)?

I soundex better than a spelling corrector a la Google?

>Perhaps a flag on the 'dictionary? Or a separate table with some sort of
>grouping function?
>
>
I've used a couple of difference schemes. The common denominator is
that I don't like any of them.

>> 7. Result ordering by "score" is very important.
>>
>>
>Another area of 'fun' - last modified date of a record could be as
>important. Find the recent occurrences first.
>
>
Firebird doesn't track date of last modification. Are you proposing it
should?

Word searches has lots of false hits. Standard scoring ideas distance
of words from top, number of occurrences, proximity of requested words.
Hard to combine these with date of last modification.

Google news allow sorting by data. Regular Google does not, though they
do track the age of documents.

>> 8. A filter boolean operator that can be mixed with SQL boolean
>> operators is important.
>>
>>
>Is that my - search existing results? I think we need a temporary table
>for this with the results list, and then allow tighter searches?
>
>
I don't think so. A better strategy for too many hits is let the user
reformulate the search.

>> 9. Search indexing should be html-aware
>>
>>
>And XML - filter to remove tags as an extension to the stop words?
>
>
That's what I meant by html-aware. Don't index <br>.

>>Exhaustive search is too expensive to even think about. Given a word
>>index with word position, it's cheaper to evaluate a given record
>>against the index than to apply an operator against a blob. Computing a
>>record number bitmap from the word index is definitely the way to go.
>>
>>
>I would expect to add entries to the search index as records are
>added/updated in the database - with the extension that allows scanning
>an external file when it's name or modified date is changed.
>
>
How would the system even know an external file was of interest?

>
>
>>This is an area where the database world has dropped the ball big time.
>>The entire world navigates the web starting at Google, but no mainstream
>>database system supports the most basic multi-table search. Did the
>>database world miss the web altogether?
>>
>>
>
>I think we agree there as well, but aren't we just talking about
>additional functions to help an extension to the core engine rather than
>something that is a totally integral function. So that application
>specific changes can be included?
>
>
Engine extensions are required.


--

Jim Starkey
Netfrastructure, Inc.
978 526-1376