firebird-architect - Re: UTF-8 vs UTF-16

Subject	Re: UTF-8 vs UTF-16
Author	peter_jacobi.rm
Post date	2003-08-16T09:32:09Z

Hi Nickolay,

One last (?) remark to the more esoteric sides
of UNICODE before someone shouts OFF TOPIC:

--- In Firebird-Architect@yahoogroups.com, Nickolay Samofatov wrote:
> --- In Firebird-Architect@yahoogroups.com, Peter Jacobi wrote:
>> I propose UTF16BE encoding of the UNICODE subset
>> U+0000..0+FFFF, i.e. forget the astral planes, they
>> would give us only troubles.
> Various OS's and products move to full Unicode standard conformance
> and thus start to support unusual codepoints. But Unicode standard
> doesn't require to implement support for all defined codepoints
> anyway.

From UNICODE 4.0:
<cite>
The Unicode Standard provides 1,114,112 code points, most of which are
available for encoding of characters. The majority of the common
characters used in the major languages of the world are encoded in the
first 65,536 code points, also known as the Basic Multilingual Plane
(BMP). The overall capacity for more than a million characters is more
than sufficient for all known character encoding requirements,
including full coverage of all minority and historic scripts of the world.
</cite>

There it is! In 4.0, UNICODE spelled out, what was ambigious
and implicit in 3.0: The promise that 65536 charcters are enough
for everybody is void and broken.

But the idea that 16 bit integers are big enough to represent
any UNICODE character is deeply entrenched in applications
(like Firebird) and entire computer languages (like Java).
Years ago, those choosing 32 bit for wide characters were
considered somewhat crazy.

But not only will many apps, OSes and computer languages
don't have any idea about what
U+10091 A LINEAR B IDEOGRAM B123 SPICE
is, but most of them will completely miss the fact that it
is a single character, especially when they see it as a
UTF-16 sequence 0xD800 0xDC91.

But does it matter?

I don't think that we should complicate matters and
increase all buffer sizes for supporting LINEAR B.
Anyway, a "weak support" viewing 0xD800 as base character
and 0xDC91 as combining mark (which is a UNICODE heresy)
will even support such exotical cases to some extent.

Regards,
Peter Jacobi