firebird-architect - RE: [Firebird-Architect] Re: UTF-8 vs UTF-16

Subject	RE: [Firebird-Architect] Re: UTF-8 vs UTF-16
Author	David Schnepper
Post date	2003-08-24T18:48:31Z

>
> But the idea that 16 bit integers are big enough to represent
> any UNICODE character is deeply entrenched in applications
> (like Firebird) and entire computer languages (like Java).
> Years ago, those choosing 32 bit for wide characters were
> considered somewhat crazy.
>
> But not only will many apps, OSes and computer languages
> don't have any idea about what
> U+10091 A LINEAR B IDEOGRAM B123 SPICE
> is, but most of them will completely miss the fact that it
> is a single character, especially when they see it as a
> UTF-16 sequence 0xD800 0xDC91.
>
> But does it matter?
>
> I don't think that we should complicate matters and
> increase all buffer sizes for supporting LINEAR B.
> Anyway, a "weak support" viewing 0xD800 as base character
> and 0xDC91 as combining mark (which is a UNICODE heresy)
> will even support such exotical cases to some extent.
>
> Regards,
> Peter Jacobi
>

I think I agree with Peter --

The issues in multi-character set support were, as
I saw them years ago

a) What is a CHARACTER ?
Specifically, what does it mean to declare
a field of CHARACTER(1) ?

b) Character fidelity

c) What you put in, is what you get out.

UNICODE_FSS allocates 3 bytes when you define CHARACTER(1) -
which is what you need to guarantee storage of any
valid character.

I think 32 bit Unicode support just isn't going to happen,
at least not for a decade.

I think, for Firebird, that UNICODE-16 support should be put
in, and if people want to store the supplimental characters
into it, well, it words for point c above, it should work
for point b -- (as I doubt there are any other character
sets that encode the characters other than how Unicode
supplimental would) - and it doesn't work for point a.

Incidently, this is similar to what happens with
CHARSET NONE. You can store Latin1, Unicode-8, Unicode-16,
or even Unicode-32 there -- but the database can't
promise to transform it into the appropriate format
when you move it around -- it's up to your application
to make sense of the sequence of bytes.

Here's my thoughts on the project:

Define a UNICODE_BE and UNICODE_LE character set.
Define a character set alias UNICODE that goes to
the proper character set, on a platform specific
basis.

the UNICODE101 internal character set becomes
one of the above - also on platform basis.
(compile time issue only)

As a "public" character set, Unicode101 disappears,
columns are publically either UNICODE_BE or Unicode_LE.

Wire format wouldn't need modification -- client
would request UNICODE_BE format as part of dpb_lc_ctype,
server would transliterate _LE into _BE for the client.

Dave