Subject | Re: [firebird-support] UTF8 and UNICODE_FSS |
---|---|
Author | Helen Borrie |
Post date | 2008-01-11T07:18:20Z |
At 04:11 PM 11/01/2008, you wrote:
For UTF8, think in terms of collations. If you can find a collation that supports the characters and mappings that you want to use, either "in the box" or in the ICU databases, then it's "supported".
UNICODE_FSS was a longstanding, idiosyncratic workaround., kind of "not much good, but better than not having unicode at all". It is basically all of the 7-bit US ascii characters plus whatever else you choose to store, split up into 3-byte sequences. No collation as such, no well-formedness checks, just static upper/lowercase mappings for those ascii-like characters and case-blind to all others, with ordering by binary encoding.
./heLen
>Hello,UNICODE_FSS is obsolete, so forget it.
>
>A couple of questions regarding the UTF8 and UNICODE_FSS character
>sets in Firebird 2.0.
>
>
>(1) Do either of these support unicode characters outside the basic
>multilingual plane?
For UTF8, think in terms of collations. If you can find a collation that supports the characters and mappings that you want to use, either "in the box" or in the ICU databases, then it's "supported".
>(2) There seems to be a big difference between the maximum fieldUNICODE_FSS stores all characters as exactly 3 bytes, even those that are shorter. It can't store characters of 4 or more bytes. Max length of a varchar is 32,765 bytes. 32765/3 = 10921.66...., i.e. a max of 10921 UNICODE_FSS *characters*. You'll get an overflow if the string is longer than 10921 characters. (For CHAR type you get one more character).
>length of CHAR and VARCHAR fields for these character sets:
>
> UNICODE_FSS: 10921
> UTF8: 8191The max length of a UTF8 varchar is determined by the number of bytes in the *largest* character addressed by the collation, as declared in the manifest and to the database. In this example you have a collation which allows for a largest character of 4 bytes. Since validation happens on byte length, it's possible for a string longer than the declared char_length to pass validation if the value actually doesn't include many characters that large.
>
>Given that they are both essentially utf-8 implementations, what is
>the reason for this difference?
UNICODE_FSS was a longstanding, idiosyncratic workaround., kind of "not much good, but better than not having unicode at all". It is basically all of the 7-bit US ascii characters plus whatever else you choose to store, split up into 3-byte sequences. No collation as such, no well-formedness checks, just static upper/lowercase mappings for those ascii-like characters and case-blind to all others, with ordering by binary encoding.
./heLen