Subject Character Set Support
Author jimbeesleyx
I've been having a challenging time getting support for multiple
character sets to work with Firebird 1.5rc7, so Helen kindly
suggested that I ask here for details about how this was meant to
work.

The basic problem is that if you have a table with 2 columns, each of
which has a different arbitrary character set (lets say UNICODE_FSS
and ISO8859_1 just as an example), there currently seems to be no way
to connect to Firebird (using the isc_dsql_* interface anyway) that
guarantees you'll be able to insert and fetch the data.

The details are actually a bit more involved (see my posts in
Firebird-devel and Firebird-Support), but they basically boil down to
a couple of facts:

* If you pick one of the character sets to connect, there may be
arbitrary characters in the other character set that can't be
represented in the connection character set - the only three
possibilities I could see to avoid this are connecting with
UNICODE_FSS, NONE, or Peter Jacobi's UTF16BE implementation.

* If you connect as UNICODE_FSS, then the server will be unable to
return any data if the size in bytes of the UNICODE_FSS
representation exceeds the size in bytes of the source column. An
example would be a CHAR(3) CHARACTER SET ISO8859_1 that
contained "ììì" - The size in ISO8859_1 is 3 bytes, but the size in
UNICODE_FSS is 6 bytes. Internally the intermediate UCS2 buffer size
is fine, but Firebird then tries to transcode the UCS2 string into a
3byte UNICODE_FSS buffer, regardless of the buffer size specified in
the SQLVAR. I tried various changes to the server code to try to
work around this, but you eventually start bumping up against the
~32K max column and ~64K max row limits.

* As for connecting with UTF16BE - some of the isc_dsql functions
(ex: isc_dsql_prepare) doesn't appear to be written to accept
characters with embedded NULLs (although maybe I'm missing something
obvious here). Regardless, most of the size issues listed above for
UNICODE_FSS still apply.

* If you connect using character set NONE, things are initially a bit
more promising. Character data is always returned (as near as I can
tell) in the SQLVAR buffers "as-is" without transcoding, and you can
look at the sqlsubtype field of the SQLVAR to see what the actual
character set is. Since my app has pretty good transcoding support,
I'm perfectly happy with that. The problems start cropping up when I
try to insert character data. I have a need to be able to specify
Unicode literals in SQL - I can do that even when I'm connected using
NONE by sending in UTF8 strings, and using an "introducer" of
_UNICODE_FSS before each literal - not ideal, but workable.
Parameters are pretty ugly though - the current code appears to try
to transcode all incoming data from NONE into the target character
set, rather than treating them "as-is" like the fetch code. This
basically causes a transcoding error if a character above 127 is
specified as parameter data - a pretty extreme limitation.

It's a pretty simple patch to have the Firebird INTL code ignore the
transcoding when to CS or from CS is CS_dynamic==CS_NONE. The
problem is that I don't know whether this is how things should work,
or whether this potentially breaks something much more important. If
interested, the trivial change consists of:

diff -u -r1.32.2.1 intl.cpp
--- firebird2/src/jrd/intl.cpp 23 Nov 2003 17:13:44 -0000
1.32.2.1
+++ firebird2/src/jrd/intl.cpp 19 Dec 2003 16:56:31 -0000
@@ -666,6 +666,12 @@
from_cs = INTL_charset(tdbb, INTL_TTYPE(from), err);
to_cs = INTL_charset(tdbb, INTL_TTYPE(to), err);

+ // No conversion if going to/from ConnectionCS==NONE
+ if ((INTL_TTYPE(from) == CS_dynamic) && (from_cs == CS_NONE))
+ from_cs = to_cs;
+ else if ((INTL_TTYPE(to) == CS_dynamic) && (to_cs == CS_NONE))
+ to_cs = from_cs;
+
p = to->dsc_address;

/* Must convert dtype(cstring,text,vary) and ttype
(ascii,binary,..intl..) */

Although I suspect there are probably better ways to accomplish the
goal.

The ideal solution would be to have the sqlsubtype field of the
SQLVAR always be used, both for input and output, but from looking at
the code that isn't likely to happen anytime soon.

-So-

How is support for multiple character sets supposed to be handled?
More specifically, if you connect using character set NONE, what is
supposed to happen and how bad would it be to change the current
(inconsistent) behavior?



Thanks in advance for any enlightenment,
-Jim Beesley