Subject Re: [firebird-support] Do collation affects index size?
Author Ann Harrison
On Mon, Sep 3, 2012 at 6:09 AM, Fabiano <fabiano@...> wrote:

>
> create domain d1 varchar(100) character set utf8;
> create table t1 (c1 d1, c2 d1);
> alter table t1 add unique (c1, c2);
>
> create domain d2 varchar(100) character set utf8 collate unicode_ci_ai;
> create table t2 (c1 d2, c2 d2);
> alter table t2 add unique (c1, c2);
> -- unsuccessful metadata update. key size exceeds implementation
> restriction for index "RDB$2".
>
> Is this error expected?
>
>
Yes, and the solution is to increase the page size. Here's what's going
on. Without a specific collation, Firebird orders strings by the binary
value of the character, leading to the well-known ugliness that all upper
case characters sort before all lower case characters and accented
characters appear even later.

For more elaborate collations, the key is broken down into a base character
(e.g. All 'A' characters (A, a, and each with accents), and another byte is
added to the end of the string to hold the more subtle effects. If I were
guessing, I would have said that unicode_ci_ai could get by with just the
base character, but the code Firebird uses for key generation obviously
thinks otherwise.

You've already got a key that's right on the edge - two 100 character UTF8
fields. Compound keys are potentially larger that single field keys in an
effort to make them actually smaller. What Firebird is doing is
suppressing trailing blanks on the first column and using a marker byte
every four bytes to indicate which field the value comes from. Assuming the
worst case (which is pretty much necessary), you might have two 100
character fields each character of which is three bytes (or four?). That's
600 or 800 bytes. Add in 120 or 160 marker bytes. Then add an extra byte
per character for secondary characteristics...

The actual keys will be much smaller because you'll never have a full 100
characters in the upper planes of Unicode. Back in early ancient history
(InterBase V1.0) we allowed any key definition and threw a runtime error
if the key was too bit. Users didn't like that very much and the problem
was much more common back in the days before multi-byte character sets.

Good luck,

Ann


[Non-text portions of this message have been removed]