Subject | Re: [firebird-support] Re: Unable to create a table with primary key field as VARCHAR of length 255 |
---|---|
Author | Ann Harrison |
Post date | 2012-03-21T18:07:04Z |
On Wed, Mar 21, 2012 at 8:55 AM, mpp12310 <mpp12310@...> wrote:
Here's a longer explanation of the maximum key size.
The relationship of key size to page size is important because a key that's
larger than 1/3 of the page size will cause the index to degenerate from a
tree to a list - every new entry causes splits all the way up the tree.
Not good. With bad luck, a single insert can turn into an infinite split.
Very bad for performance and definitely something to avoid.
Character keys are often bigger than they appear to be, first because of
multi-byte character sets and then because of collations. Only the
simplest collations generate keys that are the same size as the original
data. http://unicode.org/reports/tr10/#Multi_Level_Comparison.
Several decades ago, the InterBase group decided that the maximum allowable
key size would be set at definition time based on the ratio of the largest
possibly representation of a key to the page size. In fact, index keys are
compressed in two ways: trailing spaces in strings and trailing zeros in
numbers are eliminated before concatenation in the case of compound keys,
and prefix compression on the whole key which eliminates any bytes at the
start of the key which duplicate the previous key. So, generally, keys are
smaller than their maximum size. That's even more true now than it was in
1985 when almost all data was ASCII, using a binary collation. The most
heavily used characters in UTF8 are one or two bytes, not three or four.
Most characters do sort simply on their base identity.
On the other hand, the limits enforced are pretty minimal and an index
where every key was actually at the limit would be very deep and
inefficient. So the fact that gstat shows that you're getting 20 entries
per page when the key calculation said you'd get five is really a good
thing.
Returning to the question at hand, its not the character length that
determines the size of an index key, exactly, but the maximum number of
bytes required to represent any character in the character set plus the
maximum number of bytes required to represent the levels in the collation,
with the sum of those two multiplied by the number of characters.
Good luck,
Ann
[Non-text portions of this message have been removed]
>Not exactly character length.
> Yes, I am using UTF8 as the character set; I need to ensure that the data
> supports UNICODE and UTF8 fits best with Delphi 7.
>
> I did not know about the restrictions, based upon character length.
>
Here's a longer explanation of the maximum key size.
The relationship of key size to page size is important because a key that's
larger than 1/3 of the page size will cause the index to degenerate from a
tree to a list - every new entry causes splits all the way up the tree.
Not good. With bad luck, a single insert can turn into an infinite split.
Very bad for performance and definitely something to avoid.
Character keys are often bigger than they appear to be, first because of
multi-byte character sets and then because of collations. Only the
simplest collations generate keys that are the same size as the original
data. http://unicode.org/reports/tr10/#Multi_Level_Comparison.
Several decades ago, the InterBase group decided that the maximum allowable
key size would be set at definition time based on the ratio of the largest
possibly representation of a key to the page size. In fact, index keys are
compressed in two ways: trailing spaces in strings and trailing zeros in
numbers are eliminated before concatenation in the case of compound keys,
and prefix compression on the whole key which eliminates any bytes at the
start of the key which duplicate the previous key. So, generally, keys are
smaller than their maximum size. That's even more true now than it was in
1985 when almost all data was ASCII, using a binary collation. The most
heavily used characters in UTF8 are one or two bytes, not three or four.
Most characters do sort simply on their base identity.
On the other hand, the limits enforced are pretty minimal and an index
where every key was actually at the limit would be very deep and
inefficient. So the fact that gstat shows that you're getting 20 entries
per page when the key calculation said you'd get five is really a good
thing.
Returning to the question at hand, its not the character length that
determines the size of an index key, exactly, but the maximum number of
bytes required to represent any character in the character set plus the
maximum number of bytes required to represent the levels in the collation,
with the sum of those two multiplied by the number of characters.
Good luck,
Ann
[Non-text portions of this message have been removed]