firebird-architect - Re: [Firebird-Architect] Re: [firebird-support] Writing UTF16 to the database

Subject	Re: [Firebird-Architect] Re: [firebird-support] Writing UTF16 to the database
Author	Olivier Mascia
Post date	2005-02-28T13:57:15Z

Le 28-févr.-05, à 11:19, Lester Caine a écrit :

> Since what is stored internally can be compressed, what is stored does
> not matter, it's how it is managed that is the problem. What we ideally
> need is a fixed length record that we can do all of the string
> operations on, in which case 3 bytes per character covers every
> eventuality, but 4 bytes may be even more practical IN THE STRING
> CLASS?
> This would be in addition to the current single byte processing, and I
> think we have eliminated the need for any variable multibyte mechanism
> INTERNALLY. Just convert UTF8, UTF16 and UTF32 to the internal three
> byte string class and work all of the processing on fixed length
> strings. Don't need Unicode, then the multibyte character code is not
> needed, and we just get single byte character strings without any
> overheads?

Doesn't it sound more complex ?
I don't easily buy this. I'll keep trying though.

> Most of the current problems being highlighted come about because we do
> not have a fixed character length to work to?

No. At the user level the problems come from UNICODE_FSS Firebird
definition which states a byte storage of 3 bytes per character, and
has discrepancies regarding the number of characters it will accept to
store in the column. At the engine level the problems come from the
fact Firebird must consider each and every string as qualified by a
specific charset. The database engine would gain by using a single
'charset' internally (in memory and on disk), ultimately for now and
the future of some years at least that 'ultimate' charset is Unicode
which is precisely designed for that. A unicode character is a
represented by a number, the code-point, which basically is a 32 bits
number, though today and for some years coming (Klingon and Asgard
phonemes excluded) less than 24 bits are actually used. All other
national charsets can be mapped to Unicode. The reverse is not always
true : you can map a GB18030 string to Unicode, then back. But you
can't map a GB18030 to Unicode then back to ISO8850-1, of course. So
for user I/O, the engine can talk ISO8859_1 to some client or for some
column (expecting iso8859_1 and returning iso8859_1) while the internal
processing and storage is based on strings of 24 bits quantities.

How to handle and store those strings of 24 bits characters is then an
implementation detail.
Trivial choices are the following:

a) Opt for a 24 bits internal representation. Each character is a
triplet of bytes. Both in memory (string class to be designed to work
with such characters) and on disk. Some kind of compression can be
researched for the on-disk storage. This design consumes 3 bytes
(unless disk compression) for each and every character. Advantage: it
is incredibly simple to code. Disadvantage: playing with 24 bits
quantities is not alignment friendly and would have a negative
performance impact on some processors.

b) Opt for a 32 bits internal representation. Even simplier to
implement. Each char is a 32 bits integers. Sign issues won't even come
into the play as more than the 8 upper bits will never be used (zero)
anyway. Good compression taking into account the unused 4th byte is
required to not explode the storage size requirements. But internal
handling of strings of 32 bits characters can be streamlined on most
processors. As far as I have understood INTL (but is it the new one or
the former one ?) do map to such a 32 bits representation while doing
its stuff.

c) Opt for a 8 bits internal representation, making use of utf-8 as the
encoding pattern. This moves us to the world where byte-len !=
char-len, but this is already true for other MBCS character sets and is
simplier to program and to optimize than others MBCS.

The one thing which looks important to me, architecturally wise at
least, is to have a single lingua inside the engine. Have all
characters mapped to a single 'character set' and stored as such, even
if oxyde storage itself uses some additional mangling or compression.
Choosing the right in-memory representation for Unicode (24 bits, 32
bits, UTF-8, UTF-16, and so on) is then a technical choice. But a
technical choice that can have a wide impact. It looks to me, but who
am I in the field of Firebird DB, that utf-8 is appealing to the job.
Maybe a plain 32 bits storage would have more advantages than
disadvantages. There will undoubtly be a lot of discussion around this.

But the important thing for now is : do people recognize the advantage
of having a single internal representation or not ? Do people prefer to
capitalize on the existing multi-charset solution ? If people want to
have a single unified charset (internally), then I think it will be
hard to argue for anything else than full Unicode. If that is a point
understood and accepted, then only the choice of internal coding of
Unicode is left to discuss. Note that there are other solutions than
the above three choices of course. But they don't look much appealing.
A 16 bits based storage would require to handle cases where a character
requires 2 16 bits quantities.

I'm sorry to jump as widely in this topic as much as I do, but I feel
myself very concerned by this eventual "Unicodization" of, let's say,
Firebird 3. My only effective contribution might well be no more than
expressing my concerns, ideas, and suggestions, but I felt like needing
to at least write about it.

--
Olivier Mascia