Subject Re: Full Text Search
Author davidrushby
Lester Caine:
> > Jakarta Lucene is one of the packages I have looked at, but
> > for me it has a major drawback - "written entirely in
> > Java" :)
>
Roman Rokytskyy:
> There is a C++ port, but I do not know the quality of it.

I do: low. I've spent most of my free time during the past year
trying to use the C++ port and being frustrated by its bugs. I've
personally fixed about 97 bugs so far; every time I try to ascend to
writing client code, I have to dive back into the internals to fix
another bug. Some aspects of the C++ port's code, such as
concurrency, need to be totally discarded and rewritten from the
ground up.

The good news is that in the process I've constructed a Python wrapper
and test suite, so the bugs are being gradually eradicated, and if
they return to those code paths covered by the test suite, will be
detected in short order.

I *need* the C++ port to be reliable, so eventually it will be.
Victory is inevitable; only the degree of "Pyrrhus-ism" is in
question.

> > WHAT it is doing is no different to what we want to do. The
> > problem is HOW it does it - by building a large number of
> > files with all the various indexes in, and I could not work
> > out just how many files it needed for a simple index ;)
>
> Each file including its format is described on Lucene website.
> In general, there are frequency files, word files, and actually
> index segments (i.e. nodes).

Lucene stores and retrieves each index via an abstract Directory
interface; a transactional implementation could be created (at least
one has been, based on bdb).

Lucene indexes consist of one or more segments, which, imprecisely
speaking, are "sub-indexes". A fully "optimized" index consolidates
everything into a single segment to lower the cost of searching, but
multiple segments can exist, typically as a result of the simultaneous
augmentation and searching of the index.

Most aspects of this process are parameterized so the client
programmer can choose an appropriate compromise between speed of
indexing and speed of searching (the options that favor speed of
indexing lengthen the pauses during which search operations are
blocked). The actual number of files depends on a variety of factors;
it varies depending on the complexity of the index (how many Fields
each Document has) and the number of segments. The simplest index in
older versions of Lucene required at least 8 files per segment, but
the latest versions introduced an index format designed to reduce the
number of file handles required to be open at any given time. The C++
port, which is approximately even with Lucene 1.2, doesn't implement
that latter format.

My original question in this thread was one of optimization rather
than possibility. Specifically, are the full-text search
infrastructures integrated into relational databases standard inverted
index designs, like Lucene, or are the designs heavily modified to
better fit into a relational environment?

> > The only difference between Lucene and an internal full text
> > search is the management of the index and data.
>
> That's not true. The biggest difference is that Lucene knows
> what it returns as an answer to a query - document IDs. And a
> document is a collection of fields. We still do not know what
> we want to return. Jim claims that we should return a collection
> of all possible records from different tables (e.g. one records
> from CUSTOMER table, two from ADDRESS table, and so on). I'm not
> comfortable with this, but so far nobody suggested any other
> solution.

In Lucene, a Document is a collection of Fields; a Field is a (string
name -> string value) pair. If I were indexing relational data with
Lucene, my first impulse would be to create a Document for every table
row, and a Field for every column that needed to be indexed. Lucene
document IDs are ephemeral and an internal implementation detail, so
it wouldn't be safe to equate them to primary keys from the DB;
instead, the primary key would be stored in a Field.

As for how many tables, rows, and columns an individual query should
cover, I'd say that should be left up to the end user, rather than
defined by the database engine.

> Lester, let's assume we have this full-text feature. How do you
> want to use it? Please sketch some queries.

Milan Babuskov:
> SELECT table_name, record_id, title, details
> FROM movie, actor, director
> WHERE any_field HAS TEXT SIMILAR TO 'usual suspects'
> ORDER BY relevance_factor DESC;

I don't have an opinion as to whether full text search functionality
even belongs in a relational database engine, because I've never used
a DB that featured it. But in any case, I don't see how it would be
possible for every current and future user of Firebird to agree on
what "SIMILAR TO" means, or to agree on a ranking algorithm, or a
query syntax (does 'usual suspects' mean (phrase "usual suspects"),
(term "usual") AND (term "suspects"), (term "usual") OR (term
"suspects"), ...).

Why not define a standard programmatic interface through which full
text index+search implementations (providers, in Starkeyspeak)
interact with the database engine? That way, domain-specific search
engines could be loaded by the server from dynamic libraries, and the
criteria of "similarity" and "relevance" would be malleable. Although
Lucene may not be an optimal basis for such an interface, Lucene's
design is exemplary and deserving of study.

Jim Starkey:
> I've found that maintaining the original binary format document
> (Word, PDF, etc.) separately from the "searchable" textual
> representation to be a good balance. It is almost always
> necessary to retain the original for download and revision. It
> also is very poor practice to mix textual and binary stuff in
> the same blob column.

I agree. The task of extracting indexable plain text from rich
formats such as MS Word ought to be separate from the tasks of
indexing and searching that text.