firebird-support - Re: [firebird-support] Writing UTF16 to the database

Subject	Re: [firebird-support] Writing UTF16 to the database
Author	Scott Morgan
Post date	2005-02-16T03:46:03Z

Dave Smith wrote:

>I'm writing in C and have started using wchar_t strings internally, and
>wish to write the contents of these to a firebird db. However, having
>set the character set for the VARCHAR field to Unicode_fss, only the
>first character of the field is written to the field (as I guess it
>finds that the second byte in the string is zero). Is there a function I
>can use to write a wchar_t string (UTF-16) to this field?
>
>Alternatively, is there a way of encoding a wchar_t to a single byte
>encoding (must be able to store international data - maybe utf-8?) so
>that IT could go into firebird instead?
>

I've been meaning to ask the dev list for confirmation as there doesn't
seem to be an authoritive answer on this list, at least there wasn't
last week when a similar question popped up. But from what I can make
out, unicode support seems to be like this:

UNICODE_FSS is basicly UTF-8 (UTF-FSS was an old name for what became
UTF-8 [0], File System Safe or something like that, basicly because it
doesn't have the endian issue that UTF-16/32 have)

When you declare a [VAR]CHAR field in the UNICODE_FSS character set you
actually get 3 bytes allocated for each char, so CHAR(10) is actually 30
bytes of space. Now, and this is one area that I want to check with the
devs, I assume the data is stored in UTF-8 in this byte array, so you
could store 30 single byte chars in that field (assuming FB doesn't
check the char count as opposed to the byte count), or only 7 4-byte
chars (the current largest size UTF-8 char you need in unicode). In
general however most typical unicode sits in 3-byte UTF-8 encodings, so
things should work out if your char count is below the field size.

The complex bit comes with getting data into, and out of FB. Again this
is an area I'd like confirmed. It seems that I can send/recive UTF-8
form text through the C API okay, but thats with a DB created with the
'DEFAULT CHARACTER SET UNICODE_FSS' option. I'm not sure if anything is
affected by this, if a DB created with some other default character set
will mangle the UTF-8 down to that particular char set or if a DB
connection using the UNICODE_FSS char set will tranform normal char set
fields to UTF-8.

Of course, this get even more confusing when you add in a extra layer
like ODBC, JDBC etc. but that's an issue for the driver maintainers.

Scott

[0] http://www.unicode.org/glossary/index.html#F