firebird-architect - Re: UTF-8 vs UTF-16

Subject	Re: UTF-8 vs UTF-16
Author	peter_jacobi.rm
Post date	2003-08-15T18:25:20Z

Hi Nickolay,

First I have to state that I don't want to outsmart
the guys in the trenches and my original message should have
been more modest in formulation.

I particulary agree with your counter-argument:

> [fixing work with UNICODE_FSS] is valuable for all MBCS
> charsets.

This is true beyond doubt!

On the other hand, I'm a bit skeptical about:

> [...] but fixing work with UNICODE_FSS is
> also good and mostly trivial task.

I assume UNICODE_FSS (and its problems) are in IB/FB
since eons, and no effort was done to fix it. So perhaps
nobody uses it for production work, which makes the
pressure to fix it even smaller. Sorry, I'm speculating
again.

Also, from the discussion in devel, I was under the
impression that it is unclear whether it is possible
to fix the wire-protocol part without breaking apps.

The efficency points, which depends on whether you
prefer CHAR or VARCHAR, can be seen both ways, I assume:

> > on-the-wire and index data pays 3 bytes for every character
> ^^^^ False in almost all cases. True only for CHAR(N) fields.

[...]

> > P2) Uses network bandwidth and memory 33% more efficently
> No, exactly opposite, if you request data as SQL_VARYING and your
> data is mostly ASCII.

On a last point:

> > (C2 can be avoided and C3 can be weakened, by not storing
> > the 16bit UNICODE value directly but a transformed one)
>
> I do not think that such transformantions are good idea. What
> exactly are you going to transform ?

It's a hackish solution, but it has some plusses.
Implementation would be:

a) map U+D800..U+DFFF to U+FFFD
we don't support astral planes anyway

b) map U+0000..U+00FF to U+2000..U+20FF
will map space to U+2020 for easy compression,
will make ISQL connnecting with charset NONE display
ASCII (and ISO-8859-1) somewhat readable - connecting with
charset NONE is nevertheless silly, IMHO

c) map U+2000..U+20FF to U+D800..U+D8FF
must make the place for b)

d) map U+XX00 to U+D9XX (XX = 01..FF)
will eliminate the other source of NUL bytes

OK, this hack aside, let me restate that I would
be the first to welcome a correct UNICODE_FSS:

> I think effort should be first directed to fixing UNICODE_FSS
> implementation bugs namely:
> 1) incorrect padding of CHAR(N) values
> 2) lack of control for character string overfilling

3) prohibiting invalid UTF-8 sequences in UNICODE_FSS cols

But whereas the UNICODE_FSS fixing costs valueable server
developer time, a UTF16BE charset would only need a change
in fbintl.dll (and can be tested in fbintl2.dll).

Regards,
Peter Jacobi