firebird-architect - Character Set Support

Subject	Character Set Support
Author	Jim Beesley
Post date	2004-02-26T16:53:16Z

I've been having a challenging time getting support for multiple character sets to work with Firebird 1.5rc7, so Helen kindly suggested that I ask here for details about how this was meant to work.

The basic problem is that if you have a table with 2 columns, each of which has a different arbitrary character set (lets say UNICODE_FSS and ISO8859_1 just as an example), there currently seems to be no way to connect to Firebird (using the isc_dsql_* interface anyway) that guarantees you'll be able to insert and fetch the data.

The details are actually a bit more involved (see my posts in Firebird-devel and Firebird-Support), but they basically boil down to a couple of facts:

* If you pick one of the character sets to connect, there may be arbitrary characters in the other character set that can't be represented in the connection character set - the only three possibilities I could see to avoid this are connecting with UNICODE_FSS, NONE, or Peter Jacobi's UTF16BE implementation.

* If you connect as UNICODE_FSS, then the server will be unable to return any data if the size in bytes of the UNICODE_FSS representation exceeds the size in bytes of the source column. An example would be a CHAR(3) CHARACTER SET ISO8859_1 that contained "ììì" - The size in ISO8859_1 is 3 bytes, but the size in UNICODE_FSS is 6 bytes. Internally the intermediate UCS2 buffer size is fine, but Firebird then tries to transcode the UCS2 string into a 3byte UNICODE_FSS buffer, regardless of the buffer size specified in the SQLVAR. I tried various changes to the server code to try to work around this, but you eventually start bumping up against the ~32K max column and ~64K max row limits.

* As for connecting with UTF16BE - some of the isc_dsql functions (ex: isc_dsql_prepare) doesn't appear to be written to accept characters with embedded NULLs (although maybe I'm missing something obvious here). Regardless, most of the size issues listed above for UNICODE_FSS still apply.

* If you connect using character set NONE, things are initially a bit more promising. Character data is always returned (as near as I can tell) in the SQLVAR buffers "as-is" without transcoding, and you can look at the sqlsubtype field of the SQLVAR to see what the actual character set is. Since my app has pretty good transcoding support, I'm perfectly happy with that. The problems start cropping up when I try to insert character data. I have a need to be able to specify Unicode literals in SQL - I can do that even when I'm connected using NONE by sending in UTF8 strings, and using an "introducer" of _UNICODE_FSS before each literal - not ideal, but workable. Parameters are pretty ugly though - the current code appears to try to transcode all incoming data from NONE into the target character set, rather than treating them "as-is" like the fetch code. This basically causes a transcoding error if a character above 127 is specified as parameter data - a pretty extreme
limitation.

It's a pretty simple patch to have the Firebird INTL code ignore the transcoding when to CS or from CS is CS_dynamic==CS_NONE. The problem is that I don't know whether this is how things should work, or whether this potentially breaks something much more important. If interested, the trivial change consists of:

diff -u -r1.32.2.1 intl.cpp
--- firebird2/src/jrd/intl.cpp 23 Nov 2003 17:13:44 -0000 1.32.2.1
+++ firebird2/src/jrd/intl.cpp 19 Dec 2003 16:56:31 -0000
@@ -666,6 +666,12 @@
from_cs = INTL_charset(tdbb, INTL_TTYPE(from), err);
to_cs = INTL_charset(tdbb, INTL_TTYPE(to), err);

+ // No conversion if going to/from ConnectionCS==NONE
+ if ((INTL_TTYPE(from) == CS_dynamic) && (from_cs == CS_NONE))
+ from_cs = to_cs;
+ else if ((INTL_TTYPE(to) == CS_dynamic) && (to_cs == CS_NONE))
+ to_cs = from_cs;
+
p = to->dsc_address;

/* Must convert dtype(cstring,text,vary) and ttype(ascii,binary,..intl..) */

Although I suspect there are probably better ways to accomplish the goal.

The ideal solution would be to have the sqlsubtype field of the SQLVAR always be used, both for input and output, but from looking at the code that isn't likely to happen anytime soon.

-So-

How is support for multiple character sets supposed to be handled?
More specifically, if you connect using character set NONE, what is supposed to happen and how bad would it be to change the current behavior?

Thanks in advance for any enlightenment,
-Jim Beesley