Subject | Re: UTF-8 vs UTF-16 |
---|---|
Author | peter_jacobi.rm |
Post date | 2003-08-15T18:25:20Z |
Hi Nickolay,
First I have to state that I don't want to outsmart
the guys in the trenches and my original message should have
been more modest in formulation.
I particulary agree with your counter-argument:
On the other hand, I'm a bit skeptical about:
since eons, and no effort was done to fix it. So perhaps
nobody uses it for production work, which makes the
pressure to fix it even smaller. Sorry, I'm speculating
again.
Also, from the discussion in devel, I was under the
impression that it is unclear whether it is possible
to fix the wire-protocol part without breaking apps.
The efficency points, which depends on whether you
prefer CHAR or VARCHAR, can be seen both ways, I assume:
Implementation would be:
a) map U+D800..U+DFFF to U+FFFD
we don't support astral planes anyway
b) map U+0000..U+00FF to U+2000..U+20FF
will map space to U+2020 for easy compression,
will make ISQL connnecting with charset NONE display
ASCII (and ISO-8859-1) somewhat readable - connecting with
charset NONE is nevertheless silly, IMHO
c) map U+2000..U+20FF to U+D800..U+D8FF
must make the place for b)
d) map U+XX00 to U+D9XX (XX = 01..FF)
will eliminate the other source of NUL bytes
OK, this hack aside, let me restate that I would
be the first to welcome a correct UNICODE_FSS:
But whereas the UNICODE_FSS fixing costs valueable server
developer time, a UTF16BE charset would only need a change
in fbintl.dll (and can be tested in fbintl2.dll).
Regards,
Peter Jacobi
First I have to state that I don't want to outsmart
the guys in the trenches and my original message should have
been more modest in formulation.
I particulary agree with your counter-argument:
> [fixing work with UNICODE_FSS] is valuable for all MBCSThis is true beyond doubt!
> charsets.
On the other hand, I'm a bit skeptical about:
> [...] but fixing work with UNICODE_FSS isI assume UNICODE_FSS (and its problems) are in IB/FB
> also good and mostly trivial task.
since eons, and no effort was done to fix it. So perhaps
nobody uses it for production work, which makes the
pressure to fix it even smaller. Sorry, I'm speculating
again.
Also, from the discussion in devel, I was under the
impression that it is unclear whether it is possible
to fix the wire-protocol part without breaking apps.
The efficency points, which depends on whether you
prefer CHAR or VARCHAR, can be seen both ways, I assume:
> > on-the-wire and index data pays 3 bytes for every character[...]
> ^^^^ False in almost all cases. True only for CHAR(N) fields.
> > P2) Uses network bandwidth and memory 33% more efficentlyOn a last point:
> No, exactly opposite, if you request data as SQL_VARYING and your
> data is mostly ASCII.
> > (C2 can be avoided and C3 can be weakened, by not storingIt's a hackish solution, but it has some plusses.
> > the 16bit UNICODE value directly but a transformed one)
>
> I do not think that such transformantions are good idea. What
> exactly are you going to transform ?
Implementation would be:
a) map U+D800..U+DFFF to U+FFFD
we don't support astral planes anyway
b) map U+0000..U+00FF to U+2000..U+20FF
will map space to U+2020 for easy compression,
will make ISQL connnecting with charset NONE display
ASCII (and ISO-8859-1) somewhat readable - connecting with
charset NONE is nevertheless silly, IMHO
c) map U+2000..U+20FF to U+D800..U+D8FF
must make the place for b)
d) map U+XX00 to U+D9XX (XX = 01..FF)
will eliminate the other source of NUL bytes
OK, this hack aside, let me restate that I would
be the first to welcome a correct UNICODE_FSS:
> I think effort should be first directed to fixing UNICODE_FSS3) prohibiting invalid UTF-8 sequences in UNICODE_FSS cols
> implementation bugs namely:
> 1) incorrect padding of CHAR(N) values
> 2) lack of control for character string overfilling
But whereas the UNICODE_FSS fixing costs valueable server
developer time, a UTF16BE charset would only need a change
in fbintl.dll (and can be tested in fbintl2.dll).
Regards,
Peter Jacobi