Subject | Re: [firebird-support] Writing UTF16 to the database |
---|---|
Author | Scott Morgan |
Post date | 2005-02-19T18:00:10Z |
Lester Caine wrote:
this several times.
http://www.unicode.org/glossary/#FSS_UTF
The File System Safe bit of the name means that there isn't an endian
issue with it like the 16 and 32 bit encodings have. It has nothing to
do with embedded 0x00 bytes as the only UTF-8 char encoding with 0x00 in
it is 'NUL'.
Apart from the lack of endian issues the other reason to use UTF-8 (FSS)
is that the API (fbclient.dll) functions only pass/recive C char type
strings, not wide chars (wchar_t or similar). So if you were to pass a
wide char string to the API a) you'd have to somehow tell the API you
were passing a wide char string so that it knew to handle it correctly
and b) you'd have to cast the string pointer from a wchar_t* (or
similar) to a char* which is generally very bad practice.
More likely is that in some future version of FB (3, 4? maybe even
later) they'll add a speciallised wide char version of the FB API, but I
wouldn't hold my breath on that one.
Scott
>OK from discussions elsewhere ...UNICODE_FSS _is_ UTF-8, it's just an old name for it. We've been over
>UTF-8 can take upto 6 bytes to store data, UTF-16 needs 2 or 4 bytes to
>store the SAME data. The difference being that the bytes used to flag
>the need for an EXTRA byte are rather wasteful in UTF-8.
>The RAW data is only 21/22 bits so CAN be stored in three bytes. UTF-32
>uses four bytes, but the fourth byte is always empty, so internal
>storage as three bytes does not cause any problems.
>The FSS comes from File System Safe and just means that '00' bytes that
>would form part of a UTF-8 sequence are removed so that '00' can be used
>as the final byte of the string.
>
>
this several times.
http://www.unicode.org/glossary/#FSS_UTF
The File System Safe bit of the name means that there isn't an endian
issue with it like the 16 and 32 bit encodings have. It has nothing to
do with embedded 0x00 bytes as the only UTF-8 char encoding with 0x00 in
it is 'NUL'.
Apart from the lack of endian issues the other reason to use UTF-8 (FSS)
is that the API (fbclient.dll) functions only pass/recive C char type
strings, not wide chars (wchar_t or similar). So if you were to pass a
wide char string to the API a) you'd have to somehow tell the API you
were passing a wide char string so that it knew to handle it correctly
and b) you'd have to cast the string pointer from a wchar_t* (or
similar) to a char* which is generally very bad practice.
More likely is that in some future version of FB (3, 4? maybe even
later) they'll add a speciallised wide char version of the FB API, but I
wouldn't hold my breath on that one.
Scott