Subject | Re: [Firebird-Architect] Full Text Search |
---|---|
Author | Lester Caine |
Post date | 2005-02-05T22:53:05Z |
Jim Starkey wrote:
a file would be managed via the database. This is BLOB <-> external data
rather than BLOB <-> internal data. Another part of the management
process is maintaining the history of changes, and allowing rollback to
previous versions. It could all be done in BLOB's, but direct access to
the files reduces the load in Firebird?
are looking for a match on a set of words rather than a single word. I
don't see that there would be two index entries, just lookup on 'caine',
'cain' and 'kane' - Is this any different to comput* looking up
computer, computers, computing, computed etc.?
flexible way. THIS is probably the main area for differences of opinion
- but all orderings would be correct in their own fields.
having reduced from 10million records to 100, working on the result set
would be more efficient than doing the whole search again? It's just the
search results button.
of thing on Office documents while building their 'catalogue'. The
advantage of working in PHP is that I can look at the data filtered and
only see the raw text - which ever format of input is provided. Perhaps
this is an area where BLOB filters could be useful?
files last modified date matches that stored in the database. ( Longhorn
done the correct way ;) )
Having checked out the MSSQL way of working - it is only available on
NT,2k and 2003 Server versions. So something that would work on XP and
Unix/Linux could have a 'market'.
--
Lester Caine
-----------------------------
L.S.Caine Electronic Services
> Lester Caine wrote:The index of available documents is managed via the database. Updates to
>
>>And I want a mechanism that can be expanded to cover content of external
>>files.
>
> Not feasible. Search requires maintenence of a word index. There's no
> way to track changes to external tables to update the index.
a file would be managed via the database. This is BLOB <-> external data
rather than BLOB <-> internal data. Another part of the management
process is maintaining the history of changes, and allowing rollback to
previous versions. It could all be done in BLOB's, but direct access to
the files reduces the load in Firebird?
>>> 4. Word stem matching should be supported (ie support*)The example was a fuzzy search for a surname in genealogical data. You
>>
>>In addition to soundex
>
> I think you want to think very, very carefully about this. Soundex
> would mean two index entries per word. When you were doing a retrieval,
> would you do one with the actual words and one with the soundex (which
> is necessary noisy due to the algorithm)?
>
> I soundex better than a spelling corrector a la Google?
are looking for a match on a set of words rather than a single word. I
don't see that there would be two index entries, just lookup on 'caine',
'cain' and 'kane' - Is this any different to comput* looking up
computer, computers, computing, computed etc.?
>>Perhaps a flag on the 'dictionary? Or a separate table with some sort ofI think that is probably a main area of discussion.
>>grouping function?
>
> I've used a couple of difference schemes. The common denominator is
> that I don't like any of them.
>>> 7. Result ordering by "score" is very important.ALL my applications have a created and last_update field ...
>>
>>Another area of 'fun' - last modified date of a record could be as
>>important. Find the recent occurrences first.
>
> Firebird doesn't track date of last modification. Are you proposing it
> should?
> Word searches has lots of false hits. Standard scoring ideas distanceAnd both produce a lot of crap. So anything that can be managed in a
> of words from top, number of occurrences, proximity of requested words.
> Hard to combine these with date of last modification.
>
> Google news allow sorting by data. Regular Google does not, though they
> do track the age of documents.
flexible way. THIS is probably the main area for differences of opinion
- but all orderings would be correct in their own fields.
>>> 8. A filter boolean operator that can be mixed with SQL booleanLike Google, re-search with an extra +X or -X word. I just feel that
>>> operators is important.
>>
>>Is that my - search existing results? I think we need a temporary table
>>for this with the results list, and then allow tighter searches?
>
> I don't think so. A better strategy for too many hits is let the user
> reformulate the search.
having reduced from 10million records to 100, working on the result set
would be more efficient than doing the whole search again? It's just the
search results button.
>>> 9. Search indexing should be html-awareHaving now read the MSSQL functionality properly, they do the same sort
>>
>>And XML - filter to remove tags as an extension to the stop words?
>
> That's what I meant by html-aware. Don't index <br>.
of thing on Office documents while building their 'catalogue'. The
advantage of working in PHP is that I can look at the data filtered and
only see the raw text - which ever format of input is provided. Perhaps
this is an area where BLOB filters could be useful?
>>>Exhaustive search is too expensive to even think about. Given a wordAs I said - they are managed via the database. A cross check is that the
>>>index with word position, it's cheaper to evaluate a given record
>>>against the index than to apply an operator against a blob. Computing a
>>>record number bitmap from the word index is definitely the way to go.
>>
>>I would expect to add entries to the search index as records are
>>added/updated in the database - with the extension that allows scanning
>>an external file when it's name or modified date is changed.
>
> How would the system even know an external file was of interest?
files last modified date matches that stored in the database. ( Longhorn
done the correct way ;) )
>>>This is an area where the database world has dropped the ball big time.Would this work as a bolt-on to Vulcan?
>>>The entire world navigates the web starting at Google, but no mainstream
>>>database system supports the most basic multi-table search. Did the
>>>database world miss the web altogether?
>>
>>I think we agree there as well, but aren't we just talking about
>>additional functions to help an extension to the core engine rather than
>>something that is a totally integral function. So that application
>>specific changes can be included?
>
> Engine extensions are required.
Having checked out the MSSQL way of working - it is only available on
NT,2k and 2003 Server versions. So something that would work on XP and
Unix/Linux could have a 'market'.
--
Lester Caine
-----------------------------
L.S.Caine Electronic Services