Subject Re: [firebird-support] UTF8 in firebird ?
Author Mark Rotteveel
On Fri, 06 Jan 2012 10:03:16 +0000, Lester Caine <lester@...>
> Mark Rotteveel wrote:
>> That is not what I said. UTF-8 encoding was originally devised to allow
>> for encoding 2^31 - 1 characters using variable length encoding of 1
>> 6
>> bytes (which is afaik the entire range of unicode codepoints), but
>> because
>> UTF16 only encodes 2^16-1 characters and uses surrogate pairs for
>> order codepoints, the decision was made by the standards committee to
>> only
>> use UTF-8 encoding upto 4 bytes, so the same range of characters as
>> could be encoded to make coding between UTF16 and UTF8 easier.
> Just to correct this ...
> Unicode is 2^24-1
> 6 HEX digits
> 16 planes from 0x000000 to 0x10FFFF are currently defined.
> So all unicode characters can be defined in 3 BYTES.

The initial unicode draft defined upto 2^31-1 codepoints:
" 128 groups of
256 planes of
256 rows of
256 cells,
for an apparent total of 2,147,483,648 characters, but actually the
standard could code only 679,477,248 characters..."

The current standard says they will not go further than 0x10FFFF (partly
because of UTF-16). I assume with '3 bytes', you mean if you do not use
UTF8, because U+010000 to U+10FFFF is encoded as 11110www 10zzzzzz 10yyyyyy
10xxxxxx for codepoint 000wwwzz zzzzyyyy yyxxxxxx. "UTF-16 limits Unicode
to 10FFFFhex; therefore UTF-8 is not defined beyond that value, even if it
could easily be defined to reach 7FFFFFFFhex." (from