Subject Re: [firebird-support] Writing UTF16 to the database
Author David Johnson
I see we have a fellow java person on the list. :o)

I guess we first need to let (or help) the INTL team finish cleaning up
the MBCS code. UTF8 is an obvious and useful extension/test case of
multi-byte character set code base.

In this scenario, the "CHAR" field would actually be implemented as a
VARCHAR, ala Oracle pre-version 9.



On Sat, 2005-02-19 at 04:54, Olivier Mascia wrote:
>
>
> Ann,
>
> Le 18-févr.-05 à 20:50, Ann W. Harrison a écrit :
>
> > I am reasonably certain (90% confident) that the memory and on-disk
> > storage are the same, that the format uses a variable number of bytes
> > per character, and that it follows UTF-8 rules:
> > for single byte characters, the first bit is zero.
> > for multi-byte characters the lead byte has first n bits set to
> > one where n is the length of the character in bytes. Those n bits
> > are followed by a bit of zero. Subsequent bytes have first two
> > bits set to 10).
>
> So, to say differently, regarding what FB names 'UNICODE_FSS', the
> storage would be allocated on a fixed size of 3 bytes per character,
> though actually storing a multi-bytes stream "à la" UTF-8 in it.
>
> Characters which in UTF-8 take n bytes, use n bytes in an UNICODE_FSS
> buffer, with the buffers dimensionned on the assumption of 3 bytes per
> character. With some exceptions for system tables, where the character
> set is declared UNICODE_FSS yet the storage is based on n bytes for n
> characters because it is assumed that nothing else than plain 7 bits
> ascii will be ever stored in system tables strings.
>
> Uuh. By any and all definitions, isn't this called a mess? ;-)
>
> I wonder what will happen if some UTF-8 characters using 4 bytes are
> stored in an UNICODE_FSS column? Will it works as long as the physical
> length of 3 x n is not reached?
>
> Now, I know it's time for me to turn to the ultimate documentation (the
> code itself).
>
> Firebird would really benefit from having a real UTF-8 storage. One
> where declaring some column as:
>
> LASTNAME CHAR(30) CHARACTER SET UTF8
>
> would actually mean 30 characters (not bytes) and would not imply 90
> bytes. Now that the code base starts to look more and more real C++
> (well, okay, not so much yet), internal string handling (of variable
> byte length) should not be an issue. Record storage is already variable
> size and should not be a problem.
>
> One could even think of a single, unique internal character set, whose
> storage encoding would be using UTF-8. And all character set
> declarations would only indicate the transcoding meant to happen on I/O
> with the clients connected.
>
> LASTNAME CHAR(30) CHARACTER SET ISO8859_1 would only mean that I want
> to store and read that column in ISO8859_1 encoding. The actual storage
> and internal handling could just be simply Unicode without me knowing.
> Strings entering the system would convert from ISO8859_1 to the
> internal unicode storage, and on output, the reverse conversion would
> happen.
>
> --
> Olivier Mascia
>
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>