Subject Re: [firebird-support] UTF8 in firebird ?
Author Geoff Worboys
Michael Ludwig wrote:
> Isn't it rather that UTF-8 just follows the *Unicode*
> standard which doesn't make any provisions for codepoints
> above 1114111 (0x10FFFF) and hence doesn't require UTF-8
> to use more than four bytes for encoding?
...
> So it's a max of four bytes per character for UTF-8 as of
> Unicode 6.0, period.

Apparently it's more definite that simply "as of". I found
the following quote from this link:
http://www.unicode.org/faq//utf_bom.html

- - -
Q: Will UTF-16 ever be extended to more than a million
characters?

A: No. Both Unicode and ISO 10646 have policies in place that
formally limit future code assignment to the integer range that
can be expressed with current UTF-16 (0 to 1,114,111). Even if
other encoding forms (i.e. other UTFs) can represent larger
intergers, these policies mean that all encoding forms will
always represent the same set of characters. Over a million
possible codes is far more than enough for the goal of Unicode
of encoding characters, not glyphs. Unicode is not designed to
encode arbitrary data. If you wanted, for example, to give each
“instance of a character on paper throughout history” its own
code, you might need trillions or quadrillions of such codes;
noble as this effort might be, you would not use Unicode for
such an encoding.
- - -

So the current million codepoint limit seems pretty well fixed.

--
Geoff Worboys
Telesis Computing Pty Ltd