firebird-architect - Re: [Firebird-Architect] UTF-8 vs UTF-16

Subject	Re: [Firebird-Architect] UTF-8 vs UTF-16
Author	Nickolay Samofatov
Post date	2003-08-15T17:08:13Z

Hello, Peter !

> Given that:

> a) the implementation of UTF-8 (UNICODE_FSS) is not without problems,
> to say the least.
> b) one of the goals of UTF-8 is space saving for near-ASCII data, but
> in Firebird,

> only on-disk rows save,

^^^^ True.

> whereas in-memory,

^^^^ True, UNICODE_FSS consumes 3 bytes of memory per character.

> on-the-wire and index data pays 3 bytes for every character

^^^^ False in almost all cases. True only for CHAR(N) fields.

> I am tempted to propose adding UTF-16 (representing each [this is not
> quite true, ask if you dare] character with two bytes) support to
> Firebird - and actively promoting its use, obsoleting UNICODE_FSS.

UTF16 is nice idea, but fixing work with UNICODE_FSS is
also good and mostly trivial task. It is valuable for all MBCS
charsets.

> PROs:
> P1) Strange behaviour of UNICODE_FSS due to variable byte length of
> characters can be avoided

This bugs can be easily fixed in the engine.

> P2) Uses network bandwidth and memory 33% more efficently

No, exactly opposite, if you request data as SQL_VARYING and your
data is mostly ASCII.

> CONs:
> C1) Uses disk bandwidth less efficently for near-ASCII data.

true.

> C2) Trailing spaces compress worse, also bad for disk bandwith
> C3) Looks strange in tools connecting with character set none,
> as there are 0 bytes embedded in the data
> (C2 can be avoided and C3 can be weakened, by not storing
> the 16bit UNICODE value directly but a transformed one)

I do not think that such transformantions are good idea. What
exactly are you going to transform ?

I think effort should be first directed to fixing UNICODE_FSS
implementation bugs namely:
1) incorrect padding of CHAR(N) values
2) lack of control for character string overfilling

> Peter Jacobi

Nickolay Samofatov