Subject | Re: [firebird-support] UTF8 in firebird ? |
---|---|
Author | Mark Rotteveel |
Post date | 2012-01-07T21:30:53Z |
On 7-1-2012 14:36, Michael Ludwig wrote:
when they started on the first Unicode specs), they were thinking about
allowing upto 2^31-1 characters and the UTF-8 encoding was devised to
cover that entire range. Later it was decide that 0x00 to 0x10FFFF was
to be enough for all characters. And consequently, UTF-8 does not allow
more than 4 bytes even though more are technically possible.
Mark
--
Mark Rotteveel
> Isn't it rather that UTF-8 just follows the *Unicode* standard whichI was talking about the historical perspective. Originally (in 80s / 90s
> doesn't make any provisions for codepoints above 1114111 (0x10FFFF) and
> hence doesn't require UTF-8 to use more than four bytes for encoding?
>
> Okay, I took a look at the Unicode 6.0 standard doc, section 3.9, p.93:
>
> UTF-8 encoding form: The Unicode encoding form that assigns each
> Unicode scalar value to an unsigned byte sequence of one to four
> bytes in length, as specified in Table 3-6 and Table 3-7.
>
> And those tables also show four bytes only.
>
> So it's a max of four bytes per character for UTF-8 as of Unicode 6.0,
> period.
when they started on the first Unicode specs), they were thinking about
allowing upto 2^31-1 characters and the UTF-8 encoding was devised to
cover that entire range. Later it was decide that 0x00 to 0x10FFFF was
to be enough for all characters. And consequently, UTF-8 does not allow
more than 4 bytes even though more are technically possible.
Mark
--
Mark Rotteveel