Subject Re: [firebird-support] UTF8 in firebird ?
Author Mark Rotteveel
On Fri, 06 Jan 2012 10:03:16 +0000, Lester Caine <lester@...>
wrote:
> Mark Rotteveel wrote:
>> That is not what I said. UTF-8 encoding was originally devised to allow
>> for encoding 2^31 - 1 characters using variable length encoding of 1
to
>> 6
>> bytes (which is afaik the entire range of unicode codepoints), but
>> because
>> UTF16 only encodes 2^16-1 characters and uses surrogate pairs for
higher
>> order codepoints, the decision was made by the standards committee to
>> only
>> use UTF-8 encoding upto 4 bytes, so the same range of characters as
UTF16
>> could be encoded to make coding between UTF16 and UTF8 easier.
>
> Just to correct this ...
> Unicode is 2^24-1
> 6 HEX digits
> 16 planes from 0x000000 to 0x10FFFF are currently defined.
> So all unicode characters can be defined in 3 BYTES.

The initial unicode draft defined upto 2^31-1 codepoints:
" 128 groups of
256 planes of
256 rows of
256 cells,
for an apparent total of 2,147,483,648 characters, but actually the
standard could code only 679,477,248 characters..."
(http://en.wikipedia.org/wiki/Universal_Character_Set)

The current standard says they will not go further than 0x10FFFF (partly
because of UTF-16). I assume with '3 bytes', you mean if you do not use
UTF8, because U+010000 to U+10FFFF is encoded as 11110www 10zzzzzz 10yyyyyy
10xxxxxx for codepoint 000wwwzz zzzzyyyy yyxxxxxx. "UTF-16 limits Unicode
to 10FFFFhex; therefore UTF-8 is not defined beyond that value, even if it
could easily be defined to reach 7FFFFFFFhex." (from
http://en.wikipedia.org/wiki/UTF-8).

Mark