Subject | RE: [firebird-support] Re: Full Text Searching For Firebird |
---|---|
Author | Rick Debay |
Post date | 2005-11-29T21:28:54Z |
Stemming. I know there is an open stemming dataset out there, but I
can't remember who maintained it.
You wouldn't want to create one on your own :-)
Rick
-----Original Message-----
From: firebird-support@yahoogroups.com
[mailto:firebird-support@yahoogroups.com] On Behalf Of Alexandre Benson
Smith
Sent: Monday, November 28, 2005 4:25 PM
To: firebird-support@yahoogroups.com
Subject: Re: [firebird-support] Re: Full Text Searching For Firebird
Ivan Prenosil wrote:
I've done a simple FTS in one pet project I have. I didn't need it right
know, but I think FTS could be a plus in any kind of system.
It's too simple to know if works ok in a large volume of data.
I have described it earlier but I will outline it here in a simple form.
1.) Documents Table (ID, Title, date, creator, brief text, text, binaery
representation (.PDF, .DOC, etc.))
2.) Words Table (ID, Text)
3.) Words x Document Table (ID, DocumentID, WordID, Position)
4.) Stop Words - Irrelevant words (ID, Text)
5.) Synonyns - (ID, Word1ID, Word2ID)
6.) an SP that receive as input parameter an Word (in text form) and a
DocumentID, this SP looks to see if the given word is relevant (i.e. not
a "stop word"), if so look if it's on the Word Table, if not insert it
in the word table and then insert in the "Word x Document" Table
7.) A small app that receives the binary documento, the ascii text of
it, and call the above SP for each word in the document to compose the
document word strucuture.
8.) A seach Procedure, taht receives a list of words to search (only
BOOLEAN AND I need at the moment) this SP looks for every document that
has all the supplied words (or it's sinonyms), igonre the stop words,
measure the distance beween the words and the position (words in the
title has big relevance), and calculate the document relevance), return
it as a record set.
I think it could be expanded to a multi-table search with a bit more
work. All things are hard-coded, so it's not easy expansible.
I have never used FTS on other products so I don't know how it works
What I think in terms from a user POV.
Internal facilities to help generated the above steps that I did
hard-coded.
like
Create FullTextIndex (that initializes the repository)
ways to tell the egine what should be indexed
Alter table MyDocs Alter Description add FullTextIndex (will add Field
Description of Table MyDocs in the index list).
Tiggers will be created internally by the engine to control data change
on field Description that would populate the FullText Index
A way to search the index like:
Select * from FullText where Match('Hommer Simpson Bart school
principal') > 0.5 order by Match
would list all references that has all the referenced words (boolean
search would be need too, expressiona a la AltaVista Search)
The result of the above query should be something similar too
Relevance TableName RecordID(PK) Date Brief
1.0 Cartoons 123 12/12/04 Homer and Bart go to school...
0.5 Cartoons 125 01/01/05 Homer, Lisa, Bart play at
school...
...
Another ways to narrow down the search supplying aditional parameters
Select * from FullText where Match('Hommer Simpson Bart school
principal') in table Cartoons > 0.5 order by Match or Select * from
Cartoons where Match('Synopse', 'Hommer Simpson Bart school principal')
Cartoons.
It's the user responsability to parse teh result set and work-out how to
present the real data to the user, not big deal in my opnion. The FTS
search should provide abstract information not real records (since table
structures will be completelly diferente from a table to another)
Please, don't take the syntax on count... I just want to show how I see
the things and the mechanisms I think should be implemented.
In simple words:
A way to initialize the FTS tables/indices A way to manage what data
should be indexed A way to search data in general with facilities to
narrow the search given table name/field name A way to measure the
relevance of the items found A way to manage synonyms and stop words
(human intervention) All the work to manage the data done internally by
the engine by triggers and internal functions.
Not necessary to say that the search should respect language especifics
(case/acent ignoring/etc)
I am sure, it's far more complex than I think it is, and that I have
overlooked what should be done, the performance penalties it generate
and so on. Don't know if it is the best way to do it (probably not)
don't know how well it will scale, don't know how it handle concurrency
and a large number of users. Don't know if the FTS data and indices
should reside on the same database (maybe for back-up reasons it should
be stored separetally, not sure if it will be good or not)
As you can see, I have nothing to add here, just wild guesses, at least
you can call me a educated person and don't let again your question
without answer. :-)
Fell free to contact me in private if you prefer, since I think this
topic is a bit off-topic on this list.
I'd love to hear your opnion about how FTS should be implemented/managed
by FB.
see you !
--
Alexandre Benson Smith
Development
THOR Software e Comercial Ltda
Santo Andre - Sao Paulo - Brazil
www.thorsoftware.com.br
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Visit http://firebird.sourceforge.net and click the Resources item
on the main (top) menu. Try Knowledgebase and FAQ links !
Also search the knowledgebases at http://www.ibphoenix.com
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Yahoo! Groups Links
can't remember who maintained it.
You wouldn't want to create one on your own :-)
Rick
-----Original Message-----
From: firebird-support@yahoogroups.com
[mailto:firebird-support@yahoogroups.com] On Behalf Of Alexandre Benson
Smith
Sent: Monday, November 28, 2005 4:25 PM
To: firebird-support@yahoogroups.com
Subject: Re: [firebird-support] Re: Full Text Searching For Firebird
Ivan Prenosil wrote:
>>>do youIvan,
>>>know if firebird intend to include a full text search in a future
>>>release? thanks
>>>
>>>
>...
>
>
>>In a future release FB will support Full Text Search, but in wich
>>release I don't know.
>>
>>
>
>Fulltext system can be very simple, or it can be very complex.
>Because many potential features have impact on performance, it is
>better to implement only those features people actually need.
>But so far every discussion ended before it even started, i.e. whenever
>I ask about requirements, nobody bother to just reply, so my
>understanding is that nobody seriously want/need fulltext search.
>
>Ivan
>
>
I've done a simple FTS in one pet project I have. I didn't need it right
know, but I think FTS could be a plus in any kind of system.
It's too simple to know if works ok in a large volume of data.
I have described it earlier but I will outline it here in a simple form.
1.) Documents Table (ID, Title, date, creator, brief text, text, binaery
representation (.PDF, .DOC, etc.))
2.) Words Table (ID, Text)
3.) Words x Document Table (ID, DocumentID, WordID, Position)
4.) Stop Words - Irrelevant words (ID, Text)
5.) Synonyns - (ID, Word1ID, Word2ID)
6.) an SP that receive as input parameter an Word (in text form) and a
DocumentID, this SP looks to see if the given word is relevant (i.e. not
a "stop word"), if so look if it's on the Word Table, if not insert it
in the word table and then insert in the "Word x Document" Table
7.) A small app that receives the binary documento, the ascii text of
it, and call the above SP for each word in the document to compose the
document word strucuture.
8.) A seach Procedure, taht receives a list of words to search (only
BOOLEAN AND I need at the moment) this SP looks for every document that
has all the supplied words (or it's sinonyms), igonre the stop words,
measure the distance beween the words and the position (words in the
title has big relevance), and calculate the document relevance), return
it as a record set.
I think it could be expanded to a multi-table search with a bit more
work. All things are hard-coded, so it's not easy expansible.
I have never used FTS on other products so I don't know how it works
What I think in terms from a user POV.
Internal facilities to help generated the above steps that I did
hard-coded.
like
Create FullTextIndex (that initializes the repository)
ways to tell the egine what should be indexed
Alter table MyDocs Alter Description add FullTextIndex (will add Field
Description of Table MyDocs in the index list).
Tiggers will be created internally by the engine to control data change
on field Description that would populate the FullText Index
A way to search the index like:
Select * from FullText where Match('Hommer Simpson Bart school
principal') > 0.5 order by Match
would list all references that has all the referenced words (boolean
search would be need too, expressiona a la AltaVista Search)
The result of the above query should be something similar too
Relevance TableName RecordID(PK) Date Brief
1.0 Cartoons 123 12/12/04 Homer and Bart go to school...
0.5 Cartoons 125 01/01/05 Homer, Lisa, Bart play at
school...
...
Another ways to narrow down the search supplying aditional parameters
Select * from FullText where Match('Hommer Simpson Bart school
principal') in table Cartoons > 0.5 order by Match or Select * from
Cartoons where Match('Synopse', 'Hommer Simpson Bart school principal')
> 0.5 order by Matchthis on will do a FTS but just on data on field Synopse on table
Cartoons.
It's the user responsability to parse teh result set and work-out how to
present the real data to the user, not big deal in my opnion. The FTS
search should provide abstract information not real records (since table
structures will be completelly diferente from a table to another)
Please, don't take the syntax on count... I just want to show how I see
the things and the mechanisms I think should be implemented.
In simple words:
A way to initialize the FTS tables/indices A way to manage what data
should be indexed A way to search data in general with facilities to
narrow the search given table name/field name A way to measure the
relevance of the items found A way to manage synonyms and stop words
(human intervention) All the work to manage the data done internally by
the engine by triggers and internal functions.
Not necessary to say that the search should respect language especifics
(case/acent ignoring/etc)
I am sure, it's far more complex than I think it is, and that I have
overlooked what should be done, the performance penalties it generate
and so on. Don't know if it is the best way to do it (probably not)
don't know how well it will scale, don't know how it handle concurrency
and a large number of users. Don't know if the FTS data and indices
should reside on the same database (maybe for back-up reasons it should
be stored separetally, not sure if it will be good or not)
As you can see, I have nothing to add here, just wild guesses, at least
you can call me a educated person and don't let again your question
without answer. :-)
Fell free to contact me in private if you prefer, since I think this
topic is a bit off-topic on this list.
I'd love to hear your opnion about how FTS should be implemented/managed
by FB.
see you !
--
Alexandre Benson Smith
Development
THOR Software e Comercial Ltda
Santo Andre - Sao Paulo - Brazil
www.thorsoftware.com.br
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Visit http://firebird.sourceforge.net and click the Resources item
on the main (top) menu. Try Knowledgebase and FAQ links !
Also search the knowledgebases at http://www.ibphoenix.com
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Yahoo! Groups Links