Subject | Re: [firebird-support] Writing UTF16 to the database |
---|---|
Author | Scott Morgan |
Post date | 2005-02-19T14:02:30Z |
Olivier Mascia wrote:
count of 1 and a max of 3 in src/intl/cs_unicode_fss.c[0], which may
explain the 3 byte per char allocation. cv_unicode_fss.c seems to show
that it understands the full range of char lengths so I would guess that
the 3 byte/char figure is just a 'averaged' best guess of what a typical
char length could be.
bytes?
make sense for internal representation.
Scott
[0] There's a src/intlcpp as well, but that seems to be the same as
src/intl but with all the c files renamed to cpp
>So, to say differently, regarding what FB names 'UNICODE_FSS', theLooking at the source it seems FSS has been specified with a min byte
>storage would be allocated on a fixed size of 3 bytes per character,
>though actually storing a multi-bytes stream "à la" UTF-8 in it.
>
>Characters which in UTF-8 take n bytes, use n bytes in an UNICODE_FSS
>buffer, with the buffers dimensionned on the assumption of 3 bytes per
>character.
>
count of 1 and a max of 3 in src/intl/cs_unicode_fss.c[0], which may
explain the 3 byte per char allocation. cv_unicode_fss.c seems to show
that it understands the full range of char lengths so I would guess that
the 3 byte/char figure is just a 'averaged' best guess of what a typical
char length could be.
>Firebird would really benefit from having a real UTF-8 storage. OneBut what about the PK constraint that seems to be limited to a fixed 253
>where declaring some column as:
>
>LASTNAME CHAR(30) CHARACTER SET UTF8
>
>would actually mean 30 characters (not bytes) and would not imply 90
>bytes. Now that the code base starts to look more and more real C++
>(well, okay, not so much yet), internal string handling (of variable
>byte length) should not be an issue. Record storage is already variable
>size and should not be a problem.
>
>
bytes?
>One could even think of a single, unique internal character set, whoseIntrestingly, there is a UCS2 definition in src/intl[0] too, that would
>storage encoding would be using UTF-8. And all character set
>declarations would only indicate the transcoding meant to happen on I/O
>with the clients connected.
>
>
make sense for internal representation.
Scott
[0] There's a src/intlcpp as well, but that seems to be the same as
src/intl but with all the c files renamed to cpp