Subject | Re: [firebird-support] Writing UTF16 to the database |
---|---|
Author | Lester Caine |
Post date | 2005-02-19T16:39:14Z |
Scott Morgan wrote:
UTF-8 can take upto 6 bytes to store data, UTF-16 needs 2 or 4 bytes to
store the SAME data. The difference being that the bytes used to flag
the need for an EXTRA byte are rather wasteful in UTF-8.
The RAW data is only 21/22 bits so CAN be stored in three bytes. UTF-32
uses four bytes, but the fourth byte is always empty, so internal
storage as three bytes does not cause any problems.
The FSS comes from File System Safe and just means that '00' bytes that
would form part of a UTF-8 sequence are removed so that '00' can be used
as the final byte of the string.
So what SHOULD happen with UNICODE_FSS is that UTF-8, UTF-16 and UTF-32
strings can be written into the internal buffer with out any loss of
content, and UNICODE_FSS characters can then be converted back to UTF-8,
UTF-16 and UTF-32 as required. SO what is perhaps missing is the client
interfaces handling the string conversions.
SINCE all UNICODE has therefore been reduced to it's simplest format,
sorting and collations should be easy, but as yet only the simplest
sorting is available. THAT is the area that the INTL developers are
working on.
--
Lester Caine
-----------------------------
L.S.Caine Electronic Services
>>So, to say differently, regarding what FB names 'UNICODE_FSS', theOK from discussions elsewhere ...
>>storage would be allocated on a fixed size of 3 bytes per character,
>>though actually storing a multi-bytes stream "` la" UTF-8 in it.
>>
>>Characters which in UTF-8 take n bytes, use n bytes in an UNICODE_FSS
>>buffer, with the buffers dimensionned on the assumption of 3 bytes per
>>character.
>>
> Looking at the source it seems FSS has been specified with a min byte
> count of 1 and a max of 3 in src/intl/cs_unicode_fss.c[0], which may
> explain the 3 byte per char allocation. cv_unicode_fss.c seems to show
> that it understands the full range of char lengths so I would guess that
> the 3 byte/char figure is just a 'averaged' best guess of what a typical
> char length could be.
UTF-8 can take upto 6 bytes to store data, UTF-16 needs 2 or 4 bytes to
store the SAME data. The difference being that the bytes used to flag
the need for an EXTRA byte are rather wasteful in UTF-8.
The RAW data is only 21/22 bits so CAN be stored in three bytes. UTF-32
uses four bytes, but the fourth byte is always empty, so internal
storage as three bytes does not cause any problems.
The FSS comes from File System Safe and just means that '00' bytes that
would form part of a UTF-8 sequence are removed so that '00' can be used
as the final byte of the string.
So what SHOULD happen with UNICODE_FSS is that UTF-8, UTF-16 and UTF-32
strings can be written into the internal buffer with out any loss of
content, and UNICODE_FSS characters can then be converted back to UTF-8,
UTF-16 and UTF-32 as required. SO what is perhaps missing is the client
interfaces handling the string conversions.
SINCE all UNICODE has therefore been reduced to it's simplest format,
sorting and collations should be easy, but as yet only the simplest
sorting is available. THAT is the area that the INTL developers are
working on.
--
Lester Caine
-----------------------------
L.S.Caine Electronic Services