Subject Re: [firebird-support] UTF8 in firebird ?
Author Mark Rotteveel
On 7-1-2012 14:36, Michael Ludwig wrote:
> Isn't it rather that UTF-8 just follows the *Unicode* standard which
> doesn't make any provisions for codepoints above 1114111 (0x10FFFF) and
> hence doesn't require UTF-8 to use more than four bytes for encoding?
>
> Okay, I took a look at the Unicode 6.0 standard doc, section 3.9, p.93:
>
> UTF-8 encoding form: The Unicode encoding form that assigns each
> Unicode scalar value to an unsigned byte sequence of one to four
> bytes in length, as specified in Table 3-6 and Table 3-7.
>
> And those tables also show four bytes only.
>
> So it's a max of four bytes per character for UTF-8 as of Unicode 6.0,
> period.

I was talking about the historical perspective. Originally (in 80s / 90s
when they started on the first Unicode specs), they were thinking about
allowing upto 2^31-1 characters and the UTF-8 encoding was devised to
cover that entire range. Later it was decide that 0x00 to 0x10FFFF was
to be enough for all characters. And consequently, UTF-8 does not allow
more than 4 bytes even though more are technically possible.

Mark

--
Mark Rotteveel