Subject Re: [Firebird-Architect] Google-like scoring in databases
Author Jim Starkey
Roman Rokytskyy wrote:

>Recently I indexed JDK 1.3.1 javadocs with Lucene. Searching for
>"(input stream)~5" (give all documents having "input" and "stream" and
>distance between them is not more than 5 words) scores
>javax.sound.sampled.AudioInputStream higher than java.io.InputStream.
>
>What do you think, does it make any sense to have Google-like scoring
>system in relational databases?
>
>Under "Google-like" scoring I mean scoring when not only lexical
>scoring of the document is relevant, but also number of references on
>that document. How would we define references in this case?
>
>Jim, some time ago you have described approach you implemented in
>Netfrastructure where multiple results sets are returned by search.
>Any comments from that point of view?
>
>
Bashful me? Comments?

I believe that the browser is the one true platform and life begins with
search. I keep the
believers and send the doubters to Firebird.

The key elements of search are words and phrases that can be any
combination of required,
optional, or proscribed and a useful way to return the results, which
pretty much dictates
a scoring mechanism. St. Google has all the answers, but you've touched
on some of the
scoring factors: number of elements that hit, distance from beginning
of the document of
matched elements, distance between elements, number of repetitions,
etc. How the factors
are weighted and combined to make a final score is art, not science, and
has everything
to do with an effective search.

The primary design question of a search engine is whether the search
index is single field,
single table, single schema, or database wide (Netfrastructure is
database wide, so for
variety Firebird would probably want to implement single field). But if
you go past
single table, you need a results mechanism more powerful that a single
row, and a major
extension to the API is necessary. JDBC is infinitely extensible, but
there is absolutely
no way to extend ODBC, so usage would be necessarily restricted to the
native Firebird
API. If the search is multi-table, then you almost certainly want a
mechanism to restrict
the search domain to a specific subset of tables, which also affects the
API.

I'm not going to layout a blueprint on how to implement search other
than to say that
there is a scheme that once discovered is obviously correct. If you
have even the slightest
doubt about prospective scheme, it's wrong and you need to think some more.

But the hardest problem is recognizing that search is of absolutely
critical important
in a database implementation. The rest is details.