firebird-architect - Re: Indexing tales - part 1

Subject	Re: Indexing tales - part 1 - Siblings
Author	m_theologos
Post date	2006-10-09T16:54:31Z

--- In Firebird-Architect@yahoogroups.com, Jim Starkey <jas@...>
wrote:

>
> m_theologos wrote:
> > - ...In optimizing the CONTAINING predicate. Perhaps is necessary

> > have a few words on this:
> >
> >
> I'm afraid you don't understand the "containing" predicate. It is
> nothing more than a case insensitive substring search. It is not

word

> based and can't be indexed.
>

I don't try to implement here the WORD BASED search. This variant of
full text search I discuss in the other thread (see 'Indexing tables
- part 2'). Here I try to get your opinion regarding the phenomenon
that a 'CONTAINING' predicate needs to scan the entire table, which
span many pages. Perhaps it can be optimized to scan a tree, (if, of
course the tree will have pointers to next/prev sibling node in order
to enhance the traversal scan) because a tree will have almost always
a smaller number of 'records' (ie. nodes) so fewer comparisions will
be made. Also because the index is not so 'wide' as the table a
smaller ammount of pages will be fetched, and/or most probablly the
index will be used in other circumstances as 'Order by' because most
probably the user will do something like: Select * from t where name
containing 'Jim' order by name.

Also, siblings implementation has other important (IMHO) implications
(ie. situations which implies a (at least) partial transversal scan
of tree). See my first message.

As about the dedicated indexing, I will index _letters_ NOT _words_.
This is the main difference. So, we'll have the index (imagine that
the letters are in a tree. Between [] are recnos)

a - [ 1][ 100][ 123]
b - [ 4][ 123][ 345]
...
i - [ 888][ 890][1000][1020]
j - [ 12][ 888][ 890][1020]
k - [ 10]
m - [ 888][1020][2134]
...

...doing very few comparisions (how many chars can be in such an
index?) we'll have recno streams which we can do a logical 'AND'
between them and obtain a stream which in the above example for the
letters 'i', 'j', 'm' will be [ 888][1020] (believe me, I did this
with Delphi many years ago on a 486/66MHz using Paradox as backend
and is _very_ fast). After this go at recs [ 888] and [1020] and do
the old plain substring search to see if really our string ie. 'Jim'
is in these records, and finally we'll see that only in rec [ 888]
will be. (This was realtime on the above machine returning the blobs
imediately when the user pressed the key(s)). But perhaps here are
different things...

> Full context text search is quite a different beast. It is
> straightforward (and quite efficient) to implement as a btree, but
> rather than encoding record number, it must carry <table, field,

record,

> word position> to enable cross table, phrase, and proximity

weighting.

>

Yes, is a quite different thing. That's why I speak about it in the
other thread. But I think that isn't quite necessary to be cross-
table. This can be achieved using a JOIN perhaps?

> Historically, database guys don't "get" search. You're never going

> get any where trying to convince them you have a better way to

handle

> the index (I don't think you do) until you convince them that there

is a

> difference between full context word search and "containing". Good

luck.

>

Thanks :). Let the God help me.

hth.

m. th.