Subject Re: Document Indexing/Querying experience with FB?
Author cprund
Here's an idea. Never done this, but should work.

First you brake the problem into two distinct and related problems.
The first is parsing and validating the user-entered query into
something you'll then use to generate SQL. This phase should leave
you with a "tree" view of the user-entered query. Think "reverse
polish format".

The second phase will take the reverse polish format tree generated
in phase one and turn it into a plain-text SQL query to be run
against the Firebird server.

Detailed description:

Your users have two basic operators (AND, OR), have the ability to
use pharanteses and can search for a single word or for a phrase.
While searching for a single word can be considered to be a special
case of the multi-word search, I'd treat it separately as you may
generate simpler SQL for the single-word case.

The "nodes" in the tree format you'd be generating will all contain
an single operator (AND or OR) and have exactly two sub-nodes. The
two sub-nodes may be full nodes (ie: again, one operator + two
subnodes) or the subnodes may be search words. All nodes should be
able to generate SQL code for themselvs, using the same ability of
the sub-nodes to generate more complex SQL. (think OOP, polymorphism).

The SQL generated for an "AND" or an "OR" should be really simple: it
just uses the code for it's two sub-nodes and puts the operator in
the midlle.

The SQL generated for a "final" node would be a bit more complicated,
depending on the kind of node (ie: single word search, multiple word
search).

Imagine you're starting with a "base" query of:

"select distinct DW.DOCID from DOCUMENTwORD DW where..."

The nodes in your tree representation of the query would generate the
"where" part of the query.

The single-word where clause would look like this:

"(DW.WORDID=7)"

The operator-based where clause would look like this:

"(A OP B)" - where A is the where cause generated by the "left" sub-
node, "OP" is "OR" or "AND", B is the "where" generated by the
"right" sub-node.

More complex example.

User enteres:
("Tom" and "Swayer" and "Buck") or "Firebird"

First stage, I find the wordid for each of my words. I get:
"Tom"=1
"Swayer"=2
"Buck"=3
"Firebird"4

The final tree representation would look like this:

+---------------------- OR -----+
+----- AND --------+ "Firebird/4"
AND "Swayer/1"
"Swayer/2" "Buck/3"

SQL generation starts at the first node (the "OR"). It then
recursivley goes to the the first "AND" on the left, and then to the
second "AND" on the left, and in the end the generated "where" clause
looks like this:

((((DW.WORDID = 2) AND (DW.WORDID = 3)) AND (DW.WORDID = 1)) OR
(DW.WORDID = 4))

And now I realise that my query would not give the expected result,
but the way to build it it's ok. It just requires different "where"
clauses.

As I've initially said, you're actually facing two distinct problems.
First problem is parsing user input. Unless you have experience doing
lexers/parsers, you're going to have a tough job doing this. The
second problem is generating SQL for your pre-parsed query. You'll
face SQL problems this time (ie: know your way around JOINS and
Indexes and SQL sintax).

I don't think going to a different database engine would make a
difference. You'll still need to do both parts of the problem
(parsing user input, generating SQL query). The first half of the
problem would remain unchanged. For the second half of the problem, a
database with built-in full text search engine might give you simpler
SQL operators, but not much simpler and, behind the sceens, the query
would work the same! Depending on what you want out of the engine,
the built-in full text search might not be effective enough (ex: I
really doubt you'll get "line number" and "word number" out of the
thing)

--
Cosmin Prund

--- In firebird-support@yahoogroups.com, "Tom Conlon" <tomconlon@...>
wrote:
>
> No-one?
>
> :(
>
> --- In firebird-support@yahoogroups.com, "Tom Conlon" <tomconlon@>
> wrote:
> >
> > Hi All,
> >
> > Does anyone have experience of the above and can offer some tips?
> >
> > Especially with regard to extracting results using bracketed
keywords
> > and phrases (nothing more fancy is required currently, but I have
> > found mapping this query to SQL to be much more challenging than
first
> > appears):
> >
> > ("TRAINEE POLICEMAN" OR "TEAM LEADER") AND
> > (MONETARY OR INSURANCE AND "PRIVATE BANK")
> >
> > I've got a certain amount working with one of the key tables
> > documented below:
> >
> > CREATE TABLE DOCUMENTwORD
> > (
> > DOCID INTEGER NOT NULL,
> > WORDID INTEGER NOT NULL,
> > WORDNO SMALLINT DEFAULT 0 NOT NULL,
> > LINENO SMALLINT DEFAULT 0 NOT NULL,
> > );
> >
> > The wordno column is used in locating phrases.
> >
> > For the sake of storing the lineno it allows for the
> > recreation of a plain text version of the document.
> >
> > Is FB the best db for this type of task (250,000+ documents). The
full
> > text features of some other db's didn't quite give me what I was
after.
> >
> > Any thoughts certainly appreciated.
> > Tom
> >
>