firebird-support - Re: [firebird-support] UTF8 and UNICODE

Subject	Re: [firebird-support] UTF8 and UNICODE_FSS
Author	Helen Borrie
Post date	2008-01-11T07:18:20Z

At 04:11 PM 11/01/2008, you wrote:

>Hello,
>
>A couple of questions regarding the UTF8 and UNICODE_FSS character
>sets in Firebird 2.0.
>
>
>(1) Do either of these support unicode characters outside the basic
>multilingual plane?

UNICODE_FSS is obsolete, so forget it.

For UTF8, think in terms of collations. If you can find a collation that supports the characters and mappings that you want to use, either "in the box" or in the ICU databases, then it's "supported".

>(2) There seems to be a big difference between the maximum field
>length of CHAR and VARCHAR fields for these character sets:
>
> UNICODE_FSS: 10921

UNICODE_FSS stores all characters as exactly 3 bytes, even those that are shorter. It can't store characters of 4 or more bytes. Max length of a varchar is 32,765 bytes. 32765/3 = 10921.66...., i.e. a max of 10921 UNICODE_FSS *characters*. You'll get an overflow if the string is longer than 10921 characters. (For CHAR type you get one more character).

> UTF8: 8191
>
>Given that they are both essentially utf-8 implementations, what is
>the reason for this difference?

The max length of a UTF8 varchar is determined by the number of bytes in the *largest* character addressed by the collation, as declared in the manifest and to the database. In this example you have a collation which allows for a largest character of 4 bytes. Since validation happens on byte length, it's possible for a string longer than the declared char_length to pass validation if the value actually doesn't include many characters that large.

UNICODE_FSS was a longstanding, idiosyncratic workaround., kind of "not much good, but better than not having unicode at all". It is basically all of the 7-bit US ascii characters plus whatever else you choose to store, split up into 3-byte sequences. No collation as such, no well-formedness checks, just static upper/lowercase mappings for those ascii-like characters and case-blind to all others, with ordering by binary encoding.

./heLen