firebird-support - Re: [firebird-support] Writing UTF16 to the database

Subject	Re: [firebird-support] Writing UTF16 to the database
Author	Olivier Mascia
Post date	2005-02-19T10:54:04Z

Ann,

Le 18-févr.-05 à 20:50, Ann W. Harrison a écrit :

> I am reasonably certain (90% confident) that the memory and on-disk
> storage are the same, that the format uses a variable number of bytes
> per character, and that it follows UTF-8 rules:
> for single byte characters, the first bit is zero.
> for multi-byte characters the lead byte has first n bits set to
> one where n is the length of the character in bytes. Those n bits
> are followed by a bit of zero. Subsequent bytes have first two
> bits set to 10).

So, to say differently, regarding what FB names 'UNICODE_FSS', the
storage would be allocated on a fixed size of 3 bytes per character,
though actually storing a multi-bytes stream "à la" UTF-8 in it.

Characters which in UTF-8 take n bytes, use n bytes in an UNICODE_FSS
buffer, with the buffers dimensionned on the assumption of 3 bytes per
character. With some exceptions for system tables, where the character
set is declared UNICODE_FSS yet the storage is based on n bytes for n
characters because it is assumed that nothing else than plain 7 bits
ascii will be ever stored in system tables strings.

Uuh. By any and all definitions, isn't this called a mess? ;-)

I wonder what will happen if some UTF-8 characters using 4 bytes are
stored in an UNICODE_FSS column? Will it works as long as the physical
length of 3 x n is not reached?

Now, I know it's time for me to turn to the ultimate documentation (the
code itself).

Firebird would really benefit from having a real UTF-8 storage. One
where declaring some column as:

LASTNAME CHAR(30) CHARACTER SET UTF8

would actually mean 30 characters (not bytes) and would not imply 90
bytes. Now that the code base starts to look more and more real C++
(well, okay, not so much yet), internal string handling (of variable
byte length) should not be an issue. Record storage is already variable
size and should not be a problem.

One could even think of a single, unique internal character set, whose
storage encoding would be using UTF-8. And all character set
declarations would only indicate the transcoding meant to happen on I/O
with the clients connected.

LASTNAME CHAR(30) CHARACTER SET ISO8859_1 would only mean that I want
to store and read that column in ISO8859_1 encoding. The actual storage
and internal handling could just be simply Unicode without me knowing.
Strings entering the system would convert from ISO8859_1 to the
internal unicode storage, and on output, the reverse conversion would
happen.

--
Olivier Mascia