Subject Re: [Firebird-Architect] Re: [firebird-support] Writing UTF16 to the database
Author Lester Caine
Olivier Mascia wrote:

> a) Opt for a 24 bits internal representation. Each character is a
> triplet of bytes. Both in memory (string class to be designed to work
> with such characters) and on disk. Some kind of compression can be
> researched for the on-disk storage. This design consumes 3 bytes
> (unless disk compression) for each and every character. Advantage: it
> is incredibly simple to code. Disadvantage: playing with 24 bits
> quantities is not alignment friendly and would have a negative
> performance impact on some processors.

Hence my suggestion that while we only need three bytes in the record,
four may be better for alignment and character counts.

> b) Opt for a 32 bits internal representation. Even simplier to
> implement. Each char is a 32 bits integers. Sign issues won't even come
> into the play as more than the 8 upper bits will never be used (zero)
> anyway. Good compression taking into account the unused 4th byte is
> required to not explode the storage size requirements. But internal
> handling of strings of 32 bits characters can be streamlined on most
> processors. As far as I have understood INTL (but is it the new one or
> the former one ?) do map to such a 32 bits representation while doing
> its stuff.

We are probably into OS and hardware considerations as to what is
faster. 32bit working sounds nice for 32 and 64 bit processors ;)

> c) Opt for a 8 bits internal representation, making use of utf-8 as the
> encoding pattern. This moves us to the world where byte-len !=
> char-len, but this is already true for other MBCS character sets and is
> simplier to program and to optimize than others MBCS.

That is the one I am trying to avoid. 'run length coded' strings do not
sound user friendly at all.

> The one thing which looks important to me, architecturally wise at
> least, is to have a single lingua inside the engine. Have all
> characters mapped to a single 'character set' and stored as such, even
> if oxyde storage itself uses some additional mangling or compression.
> Choosing the right in-memory representation for Unicode (24 bits, 32
> bits, UTF-8, UTF-16, and so on) is then a technical choice. But a
> technical choice that can have a wide impact. It looks to me, but who
> am I in the field of Firebird DB, that utf-8 is appealing to the job.
> Maybe a plain 32 bits storage would have more advantages than
> disadvantages. There will undoubtly be a lot of discussion around this.

PERSONALLY I think we are looking for two modes, but the discussions on
systems tables seem to be muddying the water. The bulk of users have no
problem with 8bit character data, and will only ever use that for all of
their systems, so there should be no 'overhead' from unicode in those
situations, and the 'character set' is set as currently. When multiple
character sets are required ( my data archive crosses several languages
already ) then the multi-byte mode should be enabled as an option, and
work with one internal representation, which requires more than 16 bit
data characters.

> But the important thing for now is : do people recognize the advantage
> of having a single internal representation or not ? Do people prefer to
> capitalize on the existing multi-charset solution ? If people want to
> have a single unified charset (internally), then I think it will be
> hard to argue for anything else than full Unicode. If that is a point
> understood and accepted, then only the choice of internal coding of
> Unicode is left to discuss. Note that there are other solutions than
> the above three choices of course. But they don't look much appealing.
> A 16 bits based storage would require to handle cases where a character
> requires 2 16 bits quantities.
>
> I'm sorry to jump as widely in this topic as much as I do, but I feel
> myself very concerned by this eventual "Unicodization" of, let's say,
> Firebird 3. My only effective contribution might well be no more than
> expressing my concerns, ideas, and suggestions, but I felt like needing
> to at least write about it.

It was only when I started playing with UNICODE_FSS I realised that we
had a half way house that was not working well, and now I'm on hold
until I can get clean Unicode data into the data archive. Which is why I
started asking questions ;)

--
Lester Caine
-----------------------------
L.S.Caine Electronic Services