Subject | Re: [Firebird-Architect] Full Text Search |
---|---|
Author | Jim Starkey |
Post date | 2005-02-05T21:16:14Z |
Lester Caine wrote:
way to track changes to external tables to update the index.
would mean two index entries per word. When you were doing a retrieval,
would you do one with the actual words and one with the soundex (which
is necessary noisy due to the algorithm)?
I soundex better than a spelling corrector a la Google?
that I don't like any of them.
should?
Word searches has lots of false hits. Standard scoring ideas distance
of words from top, number of occurrences, proximity of requested words.
Hard to combine these with date of last modification.
Google news allow sorting by data. Regular Google does not, though they
do track the age of documents.
reformulate the search.
--
Jim Starkey
Netfrastructure, Inc.
978 526-1376
>And I want a mechanism that can be expanded to cover content of externalNot feasible. Search requires maintenence of a word index. There's no
>files.
>
>
way to track changes to external tables to update the index.
>Limit the fields that are 'indexed' - location of a word hasYes.
>table,field,record_no and position in record?
>
>
>> 4. Word stem matching should be supported (ie support*)I think you want to think very, very carefully about this. Soundex
>>
>>
>In addition to soundex
>
>
would mean two index entries per word. When you were doing a retrieval,
would you do one with the actual words and one with the soundex (which
is necessary noisy due to the algorithm)?
I soundex better than a spelling corrector a la Google?
>Perhaps a flag on the 'dictionary? Or a separate table with some sort ofI've used a couple of difference schemes. The common denominator is
>grouping function?
>
>
that I don't like any of them.
>> 7. Result ordering by "score" is very important.Firebird doesn't track date of last modification. Are you proposing it
>>
>>
>Another area of 'fun' - last modified date of a record could be as
>important. Find the recent occurrences first.
>
>
should?
Word searches has lots of false hits. Standard scoring ideas distance
of words from top, number of occurrences, proximity of requested words.
Hard to combine these with date of last modification.
Google news allow sorting by data. Regular Google does not, though they
do track the age of documents.
>> 8. A filter boolean operator that can be mixed with SQL booleanI don't think so. A better strategy for too many hits is let the user
>> operators is important.
>>
>>
>Is that my - search existing results? I think we need a temporary table
>for this with the results list, and then allow tighter searches?
>
>
reformulate the search.
>> 9. Search indexing should be html-awareThat's what I meant by html-aware. Don't index <br>.
>>
>>
>And XML - filter to remove tags as an extension to the stop words?
>
>
>>Exhaustive search is too expensive to even think about. Given a wordHow would the system even know an external file was of interest?
>>index with word position, it's cheaper to evaluate a given record
>>against the index than to apply an operator against a blob. Computing a
>>record number bitmap from the word index is definitely the way to go.
>>
>>
>I would expect to add entries to the search index as records are
>added/updated in the database - with the extension that allows scanning
>an external file when it's name or modified date is changed.
>
>
>Engine extensions are required.
>
>>This is an area where the database world has dropped the ball big time.
>>The entire world navigates the web starting at Google, but no mainstream
>>database system supports the most basic multi-table search. Did the
>>database world miss the web altogether?
>>
>>
>
>I think we agree there as well, but aren't we just talking about
>additional functions to help an extension to the core engine rather than
>something that is a totally integral function. So that application
>specific changes can be included?
>
>
--
Jim Starkey
Netfrastructure, Inc.
978 526-1376