Subject | Indexing tales - part 1 - Siblings |
---|---|
Author | m_theologos |
Post date | 2006-10-09T09:05:55Z |
Hi,
Studying a little the FB indexes and some issuses about when the
indexes are 'good' or 'bad' I have some small ideeas to put in your
attention.
1. Siblings
Perhaps you can do in the node struct a pointer to the next/prev
sibling node (not page) ie. if we have the tree
4
/ \
2 6
/ \ / \
1 35 7
Then 1 will contain Prev=nil, Next=2
2 will contain Prev=1, Next=3
3 will contain Prev=2, Next=4
4 will contain Prev=3, Next=5
aso.
This, IMHO, (I'm not sure, I speak from my experience with other
environments...) will greatly enhance the tree performace in the
cases of scan, full scan or partial, as in situations of:
- Sort
- Join
- In optimizing the following predicates: >, <, >=, <=, Between,
Starting with
- Other situations which I don't know or remember, and...
- ...In optimizing the CONTAINING predicate. Perhaps is necessary to
have a few words on this:
<ranting>
"Today every user wants to do a "incremental, real-time partial match
anywhere" on his text data to see only the fields which contain his
keyword or sequence of chars. Full Text Search (the keyword approach)
and Really Full Text Seacrh (any sequence of chars inside of a text
field - ie. the CONTAINING predicate) are a very desired feature
today."
</ranting> (Sorry, Claudio...)
In the following lines I'll speak a little about the
A.) easy-way (IMHO) of optimizing the 'CONTAINING' predicate.
Current situation: 'Select * from T where TextField containing
'MySting'' performs a natural scan on table returning the recs.
But if we do 'Create Index Idx1 on T(TextField)' then this index, IF
it has a selectivity < 0.8 (let's say) _AND_ a pointer to the next
sibling node then perhaps is faster to do only 800.000 comparisions
and fetch only the record which are needed, rather than to do
1.000.000 on the main table. Also, because the index is smaller then,
perhaps, it fits on a smaller number of pages than the entire table
(so the scans are faster, IMHO) and, also, the index can be easely
cached.
B.) hard-way of optimizing the 'CONTAINING' predicate.
A new command looking something like 'CREATE BITMAP INDEX Idx1 ON
T(TextField) EXCLUDING ' .!@#$%^&*(){};,' which will do an index wich
will have as keys only the letters (excluding the common/non-
important ones) and then between of duplicates of each key will do a
logical 'AND'. (IMHO, it will greatly help here, but not only here,
some new memebers next to 'btr_all_recordnumber': The
'btr_first_recno' and 'btr_last_recno' in order to cut down some
'AND' operations (ie. intersections) very fast. Perhaps,
'btr_first_recno' isn't really needed because you have a pointer to
it...
This index structure will give us very fast the 'interestning'
records which will contain the records which have the letters
'M','y','s','t','r','i','n','g' in no particular order (a remark
here: the index will help also if we'll search substrings which has
duplicate letters - the comparision will take place only one time).
After this we'll check the real data from each record of the new
formed stream to see if really the text 'MyString' (using the
existing 'Containing' alg.) is there.
I discussed this first because I was (in a way) frustated that I
remembered an interview with Dmitry and I didn't found it on Google.
(also, because I had many problems with the users due to human
errors). Google facts:
Dmitry Yemanov - 14,900 hits
Dmitry Emanov - 77 hits
Dimitry Yemanov - 39 hits
Dmitri Yemanov - 6 hits
Dimitry Emanov - 1 hit
Dimitri Yemanov - 1 hit
Any comments?
hth,
m. th.
Studying a little the FB indexes and some issuses about when the
indexes are 'good' or 'bad' I have some small ideeas to put in your
attention.
1. Siblings
Perhaps you can do in the node struct a pointer to the next/prev
sibling node (not page) ie. if we have the tree
4
/ \
2 6
/ \ / \
1 35 7
Then 1 will contain Prev=nil, Next=2
2 will contain Prev=1, Next=3
3 will contain Prev=2, Next=4
4 will contain Prev=3, Next=5
aso.
This, IMHO, (I'm not sure, I speak from my experience with other
environments...) will greatly enhance the tree performace in the
cases of scan, full scan or partial, as in situations of:
- Sort
- Join
- In optimizing the following predicates: >, <, >=, <=, Between,
Starting with
- Other situations which I don't know or remember, and...
- ...In optimizing the CONTAINING predicate. Perhaps is necessary to
have a few words on this:
<ranting>
"Today every user wants to do a "incremental, real-time partial match
anywhere" on his text data to see only the fields which contain his
keyword or sequence of chars. Full Text Search (the keyword approach)
and Really Full Text Seacrh (any sequence of chars inside of a text
field - ie. the CONTAINING predicate) are a very desired feature
today."
</ranting> (Sorry, Claudio...)
In the following lines I'll speak a little about the
A.) easy-way (IMHO) of optimizing the 'CONTAINING' predicate.
Current situation: 'Select * from T where TextField containing
'MySting'' performs a natural scan on table returning the recs.
But if we do 'Create Index Idx1 on T(TextField)' then this index, IF
it has a selectivity < 0.8 (let's say) _AND_ a pointer to the next
sibling node then perhaps is faster to do only 800.000 comparisions
and fetch only the record which are needed, rather than to do
1.000.000 on the main table. Also, because the index is smaller then,
perhaps, it fits on a smaller number of pages than the entire table
(so the scans are faster, IMHO) and, also, the index can be easely
cached.
B.) hard-way of optimizing the 'CONTAINING' predicate.
A new command looking something like 'CREATE BITMAP INDEX Idx1 ON
T(TextField) EXCLUDING ' .!@#$%^&*(){};,' which will do an index wich
will have as keys only the letters (excluding the common/non-
important ones) and then between of duplicates of each key will do a
logical 'AND'. (IMHO, it will greatly help here, but not only here,
some new memebers next to 'btr_all_recordnumber': The
'btr_first_recno' and 'btr_last_recno' in order to cut down some
'AND' operations (ie. intersections) very fast. Perhaps,
'btr_first_recno' isn't really needed because you have a pointer to
it...
This index structure will give us very fast the 'interestning'
records which will contain the records which have the letters
'M','y','s','t','r','i','n','g' in no particular order (a remark
here: the index will help also if we'll search substrings which has
duplicate letters - the comparision will take place only one time).
After this we'll check the real data from each record of the new
formed stream to see if really the text 'MyString' (using the
existing 'Containing' alg.) is there.
I discussed this first because I was (in a way) frustated that I
remembered an interview with Dmitry and I didn't found it on Google.
(also, because I had many problems with the users due to human
errors). Google facts:
Dmitry Yemanov - 14,900 hits
Dmitry Emanov - 77 hits
Dimitry Yemanov - 39 hits
Dmitri Yemanov - 6 hits
Dimitry Emanov - 1 hit
Dimitri Yemanov - 1 hit
Any comments?
hth,
m. th.