firebird-architect - UTF-8 vs UTF-16

Subject	UTF-8 vs UTF-16
Author	peter_jacobi.rm
Post date	2003-08-15T16:56:02Z

Given that:

a) the implementation of UTF-8 (UNICODE_FSS) is not without problems,
to say the least.

b) one of the goals of UTF-8 is space saving for near-ASCII data, but
in Firebird, only on-disk rows save, whereas in-memory, on-the-wire
and index data pays 3 bytes for every character

I am tempted to propose adding UTF-16 (representing each [this is not
quite true, ask if you dare] character with two bytes) support to
Firebird - and actively promoting its use, obsoleting UNICODE_FSS.

Note that UTF-16 can be implemented right now, even while there is no
direct support for Wide Characters in FB i18n architecture: just
define the beast as a rather funny MBCS having minbytesperchar =
maxbytesperchar = 2.

PROs and CONs of using UTF-16 instead of UTF-8

PROs:
P1) Strange behaviour of UNICODE_FSS due to variable byte length of
characters can be avoided
P2) Uses network bandwidth and memory 33% more efficently

CONs:
C1) Uses disk bandwidth less efficently for near-ASCII data.
C2) Trailing spaces compress worse, also bad for disk bandwith
C3) Looks strange in tools connecting with character set none,
as there are 0 bytes embedded in the data

(C2 can be avoided and C3 can be weakened, by not storing
the 16bit UNICODE value directly but a transformed one)

Regards,
Peter Jacobi
Hamburg, Germany