Subject | UTF-8 vs UTF-16 |
---|---|
Author | peter_jacobi.rm |
Post date | 2003-08-15T16:56:02Z |
Given that:
a) the implementation of UTF-8 (UNICODE_FSS) is not without problems,
to say the least.
b) one of the goals of UTF-8 is space saving for near-ASCII data, but
in Firebird, only on-disk rows save, whereas in-memory, on-the-wire
and index data pays 3 bytes for every character
I am tempted to propose adding UTF-16 (representing each [this is not
quite true, ask if you dare] character with two bytes) support to
Firebird - and actively promoting its use, obsoleting UNICODE_FSS.
Note that UTF-16 can be implemented right now, even while there is no
direct support for Wide Characters in FB i18n architecture: just
define the beast as a rather funny MBCS having minbytesperchar =
maxbytesperchar = 2.
PROs and CONs of using UTF-16 instead of UTF-8
PROs:
P1) Strange behaviour of UNICODE_FSS due to variable byte length of
characters can be avoided
P2) Uses network bandwidth and memory 33% more efficently
CONs:
C1) Uses disk bandwidth less efficently for near-ASCII data.
C2) Trailing spaces compress worse, also bad for disk bandwith
C3) Looks strange in tools connecting with character set none,
as there are 0 bytes embedded in the data
(C2 can be avoided and C3 can be weakened, by not storing
the 16bit UNICODE value directly but a transformed one)
Regards,
Peter Jacobi
Hamburg, Germany
a) the implementation of UTF-8 (UNICODE_FSS) is not without problems,
to say the least.
b) one of the goals of UTF-8 is space saving for near-ASCII data, but
in Firebird, only on-disk rows save, whereas in-memory, on-the-wire
and index data pays 3 bytes for every character
I am tempted to propose adding UTF-16 (representing each [this is not
quite true, ask if you dare] character with two bytes) support to
Firebird - and actively promoting its use, obsoleting UNICODE_FSS.
Note that UTF-16 can be implemented right now, even while there is no
direct support for Wide Characters in FB i18n architecture: just
define the beast as a rather funny MBCS having minbytesperchar =
maxbytesperchar = 2.
PROs and CONs of using UTF-16 instead of UTF-8
PROs:
P1) Strange behaviour of UNICODE_FSS due to variable byte length of
characters can be avoided
P2) Uses network bandwidth and memory 33% more efficently
CONs:
C1) Uses disk bandwidth less efficently for near-ASCII data.
C2) Trailing spaces compress worse, also bad for disk bandwith
C3) Looks strange in tools connecting with character set none,
as there are 0 bytes embedded in the data
(C2 can be avoided and C3 can be weakened, by not storing
the 16bit UNICODE value directly but a transformed one)
Regards,
Peter Jacobi
Hamburg, Germany