Subject | Re: [Firebird-Architect] Re: Index structures |
---|---|
Author | Jim Starkey |
Post date | 2003-06-10T16:43:24Z |
At 05:05 PM 6/10/03 +0200, Pavel Cisar wrote:
the cardinality goes up. Keeping in mind that it is only possible to have
255 index entries before index compression kicks in. And lets not
forget the benefits of tail compression as well.
But don't take my word for it -- measure it. I checked it from time to
time (possibly the measurement code is still lurking in there someplace).
Here are some numbers that I just ground out of Netfrastructure, which
uses the case scheme. The index is a sequentially numbered primary
key of a 22,000 record table, page size 2K. The key is a 4 byte int.
The index is two levels:
Index page 3771, level 1, 141 nodes, 266 bytes compressed, 484
uncompressed
Index page 25543, level 0, 142 nodes, 166 bytes compressed, 501
uncompressed
Note that the deeper you go into the index, the more effective the compression.
Here is some data from the large word inversion:
Index page 38837, level 3, 2 nodes, 11 bytes compressed, 11 bytes
uncompressed
Index page 3300, level 2, 89 nodes, 1028 bytes compressed, 1152 bytes
uncompressed
Index page 31083, level 1, 83 nodes, 543 bytes compressed, 948 bytes
uncompressed
Inversion page 46363, 274 nodes, 1291 compressed, 3840 uncompressed
It ranges from no compression at the root to two-thirds compression at the
leaf.
will say
why I wouldn't implement it in Netfrastructure.
Simple stated, I don't like tuning parameters. Maybe a reaction to VMS that
had 257 parameters to tweak so the VMS group could claim any performance
problem on the user's failure to tune the system properly, but also from a gut
belief that people should have to tell computers how to behave efficiently.
The entire philosophy of my index design is that it doesn't require people
to make
painful tradeoffs. The alternative to bitmap driven index retrieval is
placement control,
which is essentially impossible to get right. If it's impossible to get
right, the only
purpose is to inflict pain and suffering on mankind and to win rigged
benchmarks.
So I found away around the tuning parameters that doesn't require a full time
DBA to make it perform decently.
The proposed index scheme performs well on small databases and stinks on big
ones. By adding it, you are requiring the user to understand the two schemes,
their respective tradeoffs, and make an informed and difficult value
judgement which
might be right for his application. And if he gets it wrong, it's his own damn
fault. I don't like that. Pick a scheme that works for all cases, or at least
works very well for the important, hard cases, and works OK for the trivial
cases. Please don't make the poor user understand an alternate scheme
works well if the active part of his database fits completely in cache but
won't
as his database grows, and the bigger it gets the more he loses. Or worse,
that his performance measurements on a protocol don't scale to a production
system. It won't make you friends in the long run, but it may qualify you
for a
job at Oracle, if that's your ambition.
There are a lot of things that can be done to improve index density (Ann
reminds
me regularly over drinks). Anything you can do to increasing index fill level
pays back royally. Put your effort there. Or add variable length number
encoding
to remove the 255 byte key limit and reduce the average size of the record
number
field. That's a straightforward processor/disk tradeoff that the processor
guys are
making more attractive every day.
>Well, because db engine is always disk-bound, it may seemKey compression is one of the few algorithms that get more fruitful as
>straightforward, but it's not because many factors come to play here. The
>main ones are:
>
>1) The actual fruitfulness of key compression. This is affected by key
>complexity (number of key segments, segment types, data mangling) and
>data distribution. While some indices may have a great benefit from
>prefix compression (especially with a lot of duplicates), others may not
>and compression may turn in an obstacle. Actually, I'd like collect and
>see some real-world gstat data for indices and try to find out if there
>are some identifiable non-trivial conditions when pf. compression would
>(not) help.
the cardinality goes up. Keeping in mind that it is only possible to have
255 index entries before index compression kicks in. And lets not
forget the benefits of tail compression as well.
But don't take my word for it -- measure it. I checked it from time to
time (possibly the measurement code is still lurking in there someplace).
Here are some numbers that I just ground out of Netfrastructure, which
uses the case scheme. The index is a sequentially numbered primary
key of a 22,000 record table, page size 2K. The key is a 4 byte int.
The index is two levels:
Index page 3771, level 1, 141 nodes, 266 bytes compressed, 484
uncompressed
Index page 25543, level 0, 142 nodes, 166 bytes compressed, 501
uncompressed
Note that the deeper you go into the index, the more effective the compression.
Here is some data from the large word inversion:
Index page 38837, level 3, 2 nodes, 11 bytes compressed, 11 bytes
uncompressed
Index page 3300, level 2, 89 nodes, 1028 bytes compressed, 1152 bytes
uncompressed
Index page 31083, level 1, 83 nodes, 543 bytes compressed, 948 bytes
uncompressed
Inversion page 46363, 274 nodes, 1291 compressed, 3840 uncompressed
It ranges from no compression at the root to two-thirds compression at the
leaf.
>So, I guess that we *could* implement Arno's design as an additionalIt's not my project, so I'm not going to try to tell you what to do. But I
>mechanism for indices, and left the decision on the user/admin. At the
>same time, we *should* provide tools/mechanism to compute (try-build)
>characteristics for both methods for given key on current data (probably
>as part of gstat ?).
will say
why I wouldn't implement it in Netfrastructure.
Simple stated, I don't like tuning parameters. Maybe a reaction to VMS that
had 257 parameters to tweak so the VMS group could claim any performance
problem on the user's failure to tune the system properly, but also from a gut
belief that people should have to tell computers how to behave efficiently.
The entire philosophy of my index design is that it doesn't require people
to make
painful tradeoffs. The alternative to bitmap driven index retrieval is
placement control,
which is essentially impossible to get right. If it's impossible to get
right, the only
purpose is to inflict pain and suffering on mankind and to win rigged
benchmarks.
So I found away around the tuning parameters that doesn't require a full time
DBA to make it perform decently.
The proposed index scheme performs well on small databases and stinks on big
ones. By adding it, you are requiring the user to understand the two schemes,
their respective tradeoffs, and make an informed and difficult value
judgement which
might be right for his application. And if he gets it wrong, it's his own damn
fault. I don't like that. Pick a scheme that works for all cases, or at least
works very well for the important, hard cases, and works OK for the trivial
cases. Please don't make the poor user understand an alternate scheme
works well if the active part of his database fits completely in cache but
won't
as his database grows, and the bigger it gets the more he loses. Or worse,
that his performance measurements on a protocol don't scale to a production
system. It won't make you friends in the long run, but it may qualify you
for a
job at Oracle, if that's your ambition.
There are a lot of things that can be done to improve index density (Ann
reminds
me regularly over drinks). Anything you can do to increasing index fill level
pays back royally. Put your effort there. Or add variable length number
encoding
to remove the 255 byte key limit and reduce the average size of the record
number
field. That's a straightforward processor/disk tradeoff that the processor
guys are
making more attractive every day.
>
>Jim Starkey