firebird-architect - Full Text Search

Subject	Full Text Search
Author	Lester Caine
Post date	2005-02-05T12:55:28Z

I've taken a little break to have a proper look at Full Text Search
since I could do with tidying up this area, and I've found a couple of
articles that 'explain' some of the possible requirements.

MSSQL Full Text Search is an add on to MSSQL with it's own search
service, and how it works is explained in a series of articles ( yet to
be completed ) in Database Journal.
http://www.databasejournal.com/features/mssql/article.php/3441981
http://www.databasejournal.com/features/mssql/article.php/3454281
This months article will be on Advanced Search and Weighting
In essence this is what the IBO FTS provides, but with part of the
functionality build into triggers.
http://www.ibobjects.com/docs/FTS_ONLINE/FTS_index.html

MySQL has a built in Full Text Search implemented as an INDEX on defined
fields within the database. Just who this works in largely dependent on
which version of MySQL you are running, and is only available in MyISAM
tables. As of 4.1.1 many of the restrictions, such as there having to be
more than 4 characters in a search word ( may have been higher, I can't
find an old manual ;) ), have been removed, limited Unicode capability
has been added, and only exact match is supported.
The current MySQL manual can be found at
http://dev.mysql.com/doc/mysql/en/fulltext-search.html
with further notes on later pages.

I can't find any references to Full Text Search in the SQL2003 spec,
although I would probably expect it to form an additional part of the
spec, rather than integrated into the 'Foundation' part. There is a
large volume of definitions relating to character comparision within
character stings and how they are managed, Specifically the LIKE clause,
but STARTING WITH and CONTAINING clause are not standard SQL2003.

From a personal point of view, what I am looking for from 'FULL TEXT
SEARCH' is a way of scanning a large volume of text data for occurances
of surnames or places. The data is birth, marriage and death
certificates, wills, court proceeding, directories and the like. I need
EXACT and SOUNDEX matches, to identify the location of the match and
ideally then to look at the actual data. Where a volume of matches is
returned, then there should be an ability to search just the identified
results set for an additional text match.

The idea is that one should be able to search for 'CAINE' and match all
the 'CAIN', 'KANE', 'CANE', 'KINE' equivalents, and then search on
'BIRTH' or 'ISLE OF MAN' to further restrict the result set.

Hopefully this starts fleshing out the question - "What is Full Text
Search?"
I hope we are going down the route of a proper search engine, supported
by internal functions that enable that to work efficiently. At the very
least I see a need to create a FULL TEXT index that contains a
dictionary of words, and pointers to the location of each occurrence of
the word. Since the identification of records is a major part of this, I
don't see that a 'fully automatic' implementation is possible unless we
introduce a string of restrictions, as in the MySQL case. What I think
we are probably looking for is a set of ground rules against which other
functions can be implemented?

A simple starting point from my viewpoint is a convention on
numbering/identifying words in a document, be that a BLOB or an external
file, so that we have a standard pointer method. Doc27/Word32 - so that
if two words are being looked for we can match adjacent occurrences or
any occurrence of the words.

How does this fit in with other peoples thinking?

--
Lester Caine
-----------------------------
L.S.Caine Electronic Services