firebird-support - Re: [firebird-support] Re: Firebird and Unicode queries

Subject	Re: [firebird-support] Re: Firebird and Unicode queries
Author	Olivier Mascia
Post date	2005-02-11T11:01:06Z

Le 11-févr.-05 à 09:18, Lester Caine a écrit :

> (Helen 'The Book' is wrong - UNICODE_FSS can not store full UTF-8 As
> THAT can be up to 6 bytes. UTF-32 is a four byte truncation of UTF-8 so
> can only be stored if the upper ( or is that the lower endien ;) ) byte
> is 00)

The Book is not wrong.

The Unicode codespace extends from 0..10FFFF (hex) (21 bits), which
makes a bit more than 1.1 million codes points. Think of it as 17
"planes" of 64K characters each.

"The UTF-8 encoding form maintains transparency for all of the ASCII
code points (0x00..0x7F). That means Unicode code points U+0000..U+007F
are converted to single bytes 0x00..0x7F in UTF-8, and are thus
indistinguishable from ASCII itself. Furthermore, the values 0x00..0x7F
do not appear in any byte for the representation of any other Unicode
code point, so that there can be no ambiguity. Beyond the ASCII range
of Unicode, many of the non-ideographic scripts are represented by two
bytes per code point in UTF-8; all non-surrogate code points between
U+0800 and U+FFFF are represented by three bytes; and supplementary
code points above U+FFFF require four bytes."

Representing 32 bits using the encoding principle of UTF-8 would
require up to 6 bytes. In practice, 4 is the maximum which is needed
because the Unicode code-space does NOT span the whole 32 bits field.

What's more, UTF-32 is NOT a "four byte truncation of UTF-8".
Absolutely NOT. Here, you're wrong Lester.

UTF-8, UTF-16, UTF-32 are 3 official, well-defined, representations of
the whole Unicode code-space. UTF-32 is just the simplest of those
representations where each 32 bits codepoint is stored in a 32 bits
value using obviously 4 bytes.

"As for all of the Unicode encoding forms, UTF-32 is restricted to
representation of code points in the range 0..10FFFF16 - that is, the
Unicode codespace. This guarantees interoperability with the UTF-16 and
UTF-8 encoding forms."

The UNICODE_FSS seem to use 3 bytes, so 24 bits. So I assume that this
thing called 'UNICODE_FFS' is just like UTF-32 where the most
significant byte, which is always zero, is not stored. If that is the
case, then, YES, UNICODE_FSS can store the entire Unicode code-space
and there is a clear bi-directional full conversion possible between
any of these 4 representations : UTF-8, UTF-16, UTF-32, UNICODE_FSS.

Any language specific encoding, be it a single byte or multi-byte
encoding, can be recoded in any of these 4 alternatives. So UNICODE_FSS
should be able to store correctly any string of any language specific
code-page. The reverse is not true of course. Not all arbitrary
UNICODE_FSS string can be mapped to a ISO_8859_1 (for instance) coding.

Reading this:

http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
(especially section 2.5)

will bring much more light to any interested reader.

--
Olivier Mascia