Subject Re: [firebird-support] Writing UTF16 to the database
Author Scott Morgan
Olivier Mascia wrote:

>So, to say differently, regarding what FB names 'UNICODE_FSS', the
>storage would be allocated on a fixed size of 3 bytes per character,
>though actually storing a multi-bytes stream "à la" UTF-8 in it.
>
>Characters which in UTF-8 take n bytes, use n bytes in an UNICODE_FSS
>buffer, with the buffers dimensionned on the assumption of 3 bytes per
>character.
>

Looking at the source it seems FSS has been specified with a min byte
count of 1 and a max of 3 in src/intl/cs_unicode_fss.c[0], which may
explain the 3 byte per char allocation. cv_unicode_fss.c seems to show
that it understands the full range of char lengths so I would guess that
the 3 byte/char figure is just a 'averaged' best guess of what a typical
char length could be.

>Firebird would really benefit from having a real UTF-8 storage. One
>where declaring some column as:
>
>LASTNAME CHAR(30) CHARACTER SET UTF8
>
>would actually mean 30 characters (not bytes) and would not imply 90
>bytes. Now that the code base starts to look more and more real C++
>(well, okay, not so much yet), internal string handling (of variable
>byte length) should not be an issue. Record storage is already variable
>size and should not be a problem.
>
>

But what about the PK constraint that seems to be limited to a fixed 253
bytes?

>One could even think of a single, unique internal character set, whose
>storage encoding would be using UTF-8. And all character set
>declarations would only indicate the transcoding meant to happen on I/O
>with the clients connected.
>
>

Intrestingly, there is a UCS2 definition in src/intl[0] too, that would
make sense for internal representation.

Scott

[0] There's a src/intlcpp as well, but that seems to be the same as
src/intl but with all the c files renamed to cpp