firebird-architect - Re: Index tales - part 2

Subject	Re: Index tales - part 2 - Keyword FTS
Author	m_theologos
Post date	2006-10-11T08:14:13Z

--- In Firebird-Architect@yahoogroups.com, Jim Starkey <jas@...>
wrote:

>
> m_theologos wrote:
> > ResultList
> >
> >> that can be iterated similar to a ResultSet, but the "value" of

> >> ResultList is a ResultSet rather than a scalar value.
> >>
> >
> > And so, you'll take out the result from the server's engine. The
> > entire processing must be done on the client side building the
> > appropiate functions from scratch. I rather prefer a more SQL
> > approach, using JOINS, Views aso.
> >
> Your conclusion on what is server and what is client is both
> unwarranted and wrong.
>
> Context free search is the the sine qua non of the web. The fact

that

> they can't be expressed in SQL is a deficiency of SQL, not the web.

Yes, but we must deal now with a SQL server. (I don't like very much
SQL language either). Do you mean to extend the language in order to
allow it to deal with both relational and non-relational data on
server? And/or make API calls in order to mix the relational data
with context free search results? The main problem is to 'join' these
results. That's why I stated that the 'data' must get out from the
server in a computationally-complete language to process it.

> ...
> > Also, please observe that, generally speaking, each column to be
> > indexed tend to has its own vocabulary.
> Why is this significant?

Let's take an example:

We have two tables:
1. SUPPLIERS
============
NAME
ADDRESS
EMAIL
WWW
NOTES
...

2. PRODUCTS
===========
NAME
NOTES
...

For the EMAIL and WWW fields the 'com', 'net', 'org' will be stop-
words (in order to not have index bloat).

For the PRODUCTS.NAME will be not. (Because we'll loose products like
'COM port adaptor', 'Fishing NET' aso.

> And in any case, the aggregate size of
> multiple indexes will alway be much greater than the sum of the

sizes of

> individual indexes. Search across multiple indexes will also be

much

> slower (and harder to implement) than a combined index.
>

Harder to implement? perhaps using the actual JOIN engine (the
KEYWORD/MATCHING predicate will get the streams).

> I think you need to understand how word indexes are searched before
> jumping to unwarranted conclusions.
> > I think that (IMHO) is better to implement step-by-step things

(no

> > stupid crap inside the engine of course), rather than leaving
> > unimplemented a feature because we cannot make it perfect from

the

> > beginning.
> >
> I'm sorry you think that search is "stupid crap". I hope most

people

> disagree with you.

No, I didn't meant that the search is "stupid crap". I wanted to
clearly state that perhpaps is better to do incrementally the
features, thing which doesn't imply that we need to do half-backed
ones. On contrary, I think that the search, and especially the
quality of searching, is one of the most important problems today.

That's why I propose a working feature.
(In fact, I'm amazed a little of MySQL's history which, despite of
'discutable things' from its architecture, is the most popular open-
source SQL server. I think that this is due that it has features
which work and the men use them. I don't like that, perhaps,
sometimes they can choose better design sollutions.)

As I stated again, if developement resources exists, then do a better/
more feature rich implementation... or if your sollution is better
than mine, so far so good. You know your code better than me. I did
these things before but only building dedicated, custom-made FTS
search engines which must search near real-time their data. I didn't
deal very much with the inners of Firebird.

> > ...but please take in consideration that behind the web page is

> > app which deals with concrete kinds of data orgainzed in tables,
> > "kinds" from _human_ point of view (I don't mean here 'data

types'),

> > so as you observed, 'There are good uses for a heavily

restrictive

> > search...' In conclusion, I think that a multi-table multi-field

FTS

> > index is good to have but having only this is a 'heavy' thing to

deal

> > with IMHO. (No SQL, lack of speed, difficult to refine aso.)
> >
> I haven't argued against a restricted search. Netfrastructure, in

fact,

> supports restriction to specific tables or fields in both the API

and

> SQL extensions. That isn't the issue. The issue is whether a

combined

> multi-table, multi-column index is better or worse than single

field

> index. Clearly the multi-table/field index can be used to search a
> single field efficiently. Do you claim the reverse is true -- that

> multi-table, multi-field search based on an indeterminate number of
> individual field indexes is efficient?
>

I think that this is the main difference between us. Also I think
that it comes from the fact that I work 'datatabase apps' (with a
presentation layer on Desktop and/or web - doesn't matter) in which a
page with a general 'Search' edit field is rarely used, and sometimes
dangerous. For example, issuing a Search for the word 'secret',
looking for a product the stream will return (where? in a table? in a
array?) something like:

Products.Name: "Lord's secret" (A book, for example)
Acc.Description: "My best kept secret account"

So, multi-table FTS searching will be rarely used and thus, perhaps,
such a global composite index will make things slower?
To be honest until the end, I have some app in which I do a cross-
table keyword search but there, IMHO, is much more appropiate to have
a keyword 'warehouse' from which to extract them in order to have the
keyword persistence, thing without the quality of any such cross-
table keyword search will be very discutable. (That's why in my fist
message from 'Index tales - part 1' I presented some Google results)
Ie. I have a struct like:

Table 1: Books(BookID, NAME, DIZ, <...other fields...>);
Table 2: Paintings(PaintingID, NAME, DIZ, <...>);
<...Other similar tables...>

Table: Keywords(KeyID, NAME, ParentID);

Link Tables:

BookKeys(BookID, KeyID);
PaintingKeys(PaintingID, KeyID);
<...etc...>

I was enforced to this structure in order to maintain the high
quality of search. One user (with one vocabulary) entered the books
data (using for example the word 'amazing') and another user entered
the paintings data (using the word 'marvellous' and, sometimes,
'marvelous'). There were many problems (especially because the
english and greek (the app's language) as many other languages aren't
'wyswyg' so the user's mistakes and word different prefferences are
very probable. Also, Building this table I saw that not so many
keywords are common to all of tables or a group of tables, even if
the tables are storing objects that rather are close between them.
The things are the same in other apps in which I use this (shared)
keyword engine.

On the other side, you're targeted to 'web pages' where as you stated
the 'Search the entire site' is sine qua non.

hth,

m. th.

>
> --
>
> Jim Starkey, Senior Software Architect
> MySQL AB, www.mysql.com
> 978 526-1376
>