Subject | Re: [firebird-support] UTF8 in firebird ? |
---|---|
Author | Mark Rotteveel |
Post date | 2012-01-06T10:19:10Z |
On Fri, 06 Jan 2012 10:03:16 +0000, Lester Caine <lester@...>
wrote:
" 128 groups of
256 planes of
256 rows of
256 cells,
for an apparent total of 2,147,483,648 characters, but actually the
standard could code only 679,477,248 characters..."
(http://en.wikipedia.org/wiki/Universal_Character_Set)
The current standard says they will not go further than 0x10FFFF (partly
because of UTF-16). I assume with '3 bytes', you mean if you do not use
UTF8, because U+010000 to U+10FFFF is encoded as 11110www 10zzzzzz 10yyyyyy
10xxxxxx for codepoint 000wwwzz zzzzyyyy yyxxxxxx. "UTF-16 limits Unicode
to 10FFFFhex; therefore UTF-8 is not defined beyond that value, even if it
could easily be defined to reach 7FFFFFFFhex." (from
http://en.wikipedia.org/wiki/UTF-8).
Mark
wrote:
> Mark Rotteveel wrote:to
>> That is not what I said. UTF-8 encoding was originally devised to allow
>> for encoding 2^31 - 1 characters using variable length encoding of 1
>> 6higher
>> bytes (which is afaik the entire range of unicode codepoints), but
>> because
>> UTF16 only encodes 2^16-1 characters and uses surrogate pairs for
>> order codepoints, the decision was made by the standards committee toUTF16
>> only
>> use UTF-8 encoding upto 4 bytes, so the same range of characters as
>> could be encoded to make coding between UTF16 and UTF8 easier.The initial unicode draft defined upto 2^31-1 codepoints:
>
> Just to correct this ...
> Unicode is 2^24-1
> 6 HEX digits
> 16 planes from 0x000000 to 0x10FFFF are currently defined.
> So all unicode characters can be defined in 3 BYTES.
" 128 groups of
256 planes of
256 rows of
256 cells,
for an apparent total of 2,147,483,648 characters, but actually the
standard could code only 679,477,248 characters..."
(http://en.wikipedia.org/wiki/Universal_Character_Set)
The current standard says they will not go further than 0x10FFFF (partly
because of UTF-16). I assume with '3 bytes', you mean if you do not use
UTF8, because U+010000 to U+10FFFF is encoded as 11110www 10zzzzzz 10yyyyyy
10xxxxxx for codepoint 000wwwzz zzzzyyyy yyxxxxxx. "UTF-16 limits Unicode
to 10FFFFhex; therefore UTF-8 is not defined beyond that value, even if it
could easily be defined to reach 7FFFFFFFhex." (from
http://en.wikipedia.org/wiki/UTF-8).
Mark