Subject | Re: Indexing tales - part 1 - Siblings |
---|---|
Author | m_theologos |
Post date | 2006-10-09T16:54:31Z |
--- In Firebird-Architect@yahoogroups.com, Jim Starkey <jas@...>
wrote:
full text search I discuss in the other thread (see 'Indexing tables
- part 2'). Here I try to get your opinion regarding the phenomenon
that a 'CONTAINING' predicate needs to scan the entire table, which
span many pages. Perhaps it can be optimized to scan a tree, (if, of
course the tree will have pointers to next/prev sibling node in order
to enhance the traversal scan) because a tree will have almost always
a smaller number of 'records' (ie. nodes) so fewer comparisions will
be made. Also because the index is not so 'wide' as the table a
smaller ammount of pages will be fetched, and/or most probablly the
index will be used in other circumstances as 'Order by' because most
probably the user will do something like: Select * from t where name
containing 'Jim' order by name.
Also, siblings implementation has other important (IMHO) implications
(ie. situations which implies a (at least) partial transversal scan
of tree). See my first message.
As about the dedicated indexing, I will index _letters_ NOT _words_.
This is the main difference. So, we'll have the index (imagine that
the letters are in a tree. Between [] are recnos)
a - [ 1][ 100][ 123]
b - [ 4][ 123][ 345]
...
i - [ 888][ 890][1000][1020]
j - [ 12][ 888][ 890][1020]
k - [ 10]
m - [ 888][1020][2134]
...
...doing very few comparisions (how many chars can be in such an
index?) we'll have recno streams which we can do a logical 'AND'
between them and obtain a stream which in the above example for the
letters 'i', 'j', 'm' will be [ 888][1020] (believe me, I did this
with Delphi many years ago on a 486/66MHz using Paradox as backend
and is _very_ fast). After this go at recs [ 888] and [1020] and do
the old plain substring search to see if really our string ie. 'Jim'
is in these records, and finally we'll see that only in rec [ 888]
will be. (This was realtime on the above machine returning the blobs
imediately when the user pressed the key(s)). But perhaps here are
different things...
other thread. But I think that isn't quite necessary to be cross-
table. This can be achieved using a JOIN perhaps?
hth.
m. th.
wrote:
>to
> m_theologos wrote:
> > - ...In optimizing the CONTAINING predicate. Perhaps is necessary
> > have a few words on this:word
> >
> >
> I'm afraid you don't understand the "containing" predicate. It is
> nothing more than a case insensitive substring search. It is not
> based and can't be indexed.I don't try to implement here the WORD BASED search. This variant of
>
full text search I discuss in the other thread (see 'Indexing tables
- part 2'). Here I try to get your opinion regarding the phenomenon
that a 'CONTAINING' predicate needs to scan the entire table, which
span many pages. Perhaps it can be optimized to scan a tree, (if, of
course the tree will have pointers to next/prev sibling node in order
to enhance the traversal scan) because a tree will have almost always
a smaller number of 'records' (ie. nodes) so fewer comparisions will
be made. Also because the index is not so 'wide' as the table a
smaller ammount of pages will be fetched, and/or most probablly the
index will be used in other circumstances as 'Order by' because most
probably the user will do something like: Select * from t where name
containing 'Jim' order by name.
Also, siblings implementation has other important (IMHO) implications
(ie. situations which implies a (at least) partial transversal scan
of tree). See my first message.
As about the dedicated indexing, I will index _letters_ NOT _words_.
This is the main difference. So, we'll have the index (imagine that
the letters are in a tree. Between [] are recnos)
a - [ 1][ 100][ 123]
b - [ 4][ 123][ 345]
...
i - [ 888][ 890][1000][1020]
j - [ 12][ 888][ 890][1020]
k - [ 10]
m - [ 888][1020][2134]
...
...doing very few comparisions (how many chars can be in such an
index?) we'll have recno streams which we can do a logical 'AND'
between them and obtain a stream which in the above example for the
letters 'i', 'j', 'm' will be [ 888][1020] (believe me, I did this
with Delphi many years ago on a 486/66MHz using Paradox as backend
and is _very_ fast). After this go at recs [ 888] and [1020] and do
the old plain substring search to see if really our string ie. 'Jim'
is in these records, and finally we'll see that only in rec [ 888]
will be. (This was realtime on the above machine returning the blobs
imediately when the user pressed the key(s)). But perhaps here are
different things...
> Full context text search is quite a different beast. It isrecord,
> straightforward (and quite efficient) to implement as a btree, but
> rather than encoding record number, it must carry <table, field,
> word position> to enable cross table, phrase, and proximityweighting.
>Yes, is a quite different thing. That's why I speak about it in the
other thread. But I think that isn't quite necessary to be cross-
table. This can be achieved using a JOIN perhaps?
> Historically, database guys don't "get" search. You're never goingto
> get any where trying to convince them you have a better way tohandle
> the index (I don't think you do) until you convince them that thereis a
> difference between full context word search and "containing". Goodluck.
>Thanks :). Let the God help me.
hth.
m. th.