firebird-architect - Re: [Firebird-Architect] Full Text Search

Subject	Re: [Firebird-Architect] Full Text Search
Author	Lester Caine
Post date	2005-02-05T22:53:05Z

Jim Starkey wrote:

> Lester Caine wrote:
>
>>And I want a mechanism that can be expanded to cover content of external
>>files.
>
> Not feasible. Search requires maintenence of a word index. There's no
> way to track changes to external tables to update the index.

The index of available documents is managed via the database. Updates to
a file would be managed via the database. This is BLOB <-> external data
rather than BLOB <-> internal data. Another part of the management
process is maintaining the history of changes, and allowing rollback to
previous versions. It could all be done in BLOB's, but direct access to
the files reduces the load in Firebird?

>>> 4. Word stem matching should be supported (ie support*)
>>
>>In addition to soundex
>
> I think you want to think very, very carefully about this. Soundex
> would mean two index entries per word. When you were doing a retrieval,
> would you do one with the actual words and one with the soundex (which
> is necessary noisy due to the algorithm)?
>
> I soundex better than a spelling corrector a la Google?

The example was a fuzzy search for a surname in genealogical data. You
are looking for a match on a set of words rather than a single word. I
don't see that there would be two index entries, just lookup on 'caine',
'cain' and 'kane' - Is this any different to comput* looking up
computer, computers, computing, computed etc.?

>>Perhaps a flag on the 'dictionary? Or a separate table with some sort of
>>grouping function?
>
> I've used a couple of difference schemes. The common denominator is
> that I don't like any of them.

I think that is probably a main area of discussion.

>>> 7. Result ordering by "score" is very important.
>>
>>Another area of 'fun' - last modified date of a record could be as
>>important. Find the recent occurrences first.
>
> Firebird doesn't track date of last modification. Are you proposing it
> should?

ALL my applications have a created and last_update field ...

> Word searches has lots of false hits. Standard scoring ideas distance
> of words from top, number of occurrences, proximity of requested words.
> Hard to combine these with date of last modification.
>
> Google news allow sorting by data. Regular Google does not, though they
> do track the age of documents.

And both produce a lot of crap. So anything that can be managed in a
flexible way. THIS is probably the main area for differences of opinion
- but all orderings would be correct in their own fields.

>>> 8. A filter boolean operator that can be mixed with SQL boolean
>>> operators is important.
>>
>>Is that my - search existing results? I think we need a temporary table
>>for this with the results list, and then allow tighter searches?
>
> I don't think so. A better strategy for too many hits is let the user
> reformulate the search.

Like Google, re-search with an extra +X or -X word. I just feel that
having reduced from 10million records to 100, working on the result set
would be more efficient than doing the whole search again? It's just the
search results button.

>>> 9. Search indexing should be html-aware
>>
>>And XML - filter to remove tags as an extension to the stop words?
>
> That's what I meant by html-aware. Don't index <br>.

Having now read the MSSQL functionality properly, they do the same sort
of thing on Office documents while building their 'catalogue'. The
advantage of working in PHP is that I can look at the data filtered and
only see the raw text - which ever format of input is provided. Perhaps
this is an area where BLOB filters could be useful?

>>>Exhaustive search is too expensive to even think about. Given a word
>>>index with word position, it's cheaper to evaluate a given record
>>>against the index than to apply an operator against a blob. Computing a
>>>record number bitmap from the word index is definitely the way to go.
>>
>>I would expect to add entries to the search index as records are
>>added/updated in the database - with the extension that allows scanning
>>an external file when it's name or modified date is changed.
>
> How would the system even know an external file was of interest?

As I said - they are managed via the database. A cross check is that the
files last modified date matches that stored in the database. ( Longhorn
done the correct way ;) )

>>>This is an area where the database world has dropped the ball big time.
>>>The entire world navigates the web starting at Google, but no mainstream
>>>database system supports the most basic multi-table search. Did the
>>>database world miss the web altogether?
>>
>>I think we agree there as well, but aren't we just talking about
>>additional functions to help an extension to the core engine rather than
>>something that is a totally integral function. So that application
>>specific changes can be included?
>
> Engine extensions are required.

Would this work as a bolt-on to Vulcan?
Having checked out the MSSQL way of working - it is only available on
NT,2k and 2003 Server versions. So something that would work on XP and
Unix/Linux could have a 'market'.

--
Lester Caine
-----------------------------
L.S.Caine Electronic Services