firebird-python - Character set issues

Subject	Character set issues
Author	peter_jacobi.rm
Post date	2008-10-06T23:53:01Z

Dear All,

After a long interruption I'm back at Firebird and kinterbasdb and I'm
just also back at character set troubles, my favorite pastime.

My test setup is WinXP, Firebird 1.5.5, Python 2.5.2 and kinterbasdb
3.3pre (20070503).

I'm initializing kinterbasdb with type_conv = 200, which is the only
sensible choice in this setup, if I'm not mistaken.

Now, after some tests, I'd like to share some observations and
suggestions. I'll start with the sane ones and will hold back with the
not so sane ones until another posting.

== CHARACTER SET ASCII isn't handled correctly ==

kinterbasdb will not accept Unicode input for ASCII database fields,
despite the fact that the operation isn't less well defined as the
conversion to any other character set.

It's correct to fail on Unicode to NONE and Unicode to OCTETS
conversions, as they are illdefined. But this argument doesn't apply
to ASCII. And with Unicode the default string type in future Python
versions, IMHO a change is needed.

== The connection character set is ignored ==

kinterbasdb is willing to set a connection character set, but it will
ignore it for its own transcoding!

This implies, that only connection character set NONE or a connection
character set equal to (or being a superset of) all column character
sets will work.

In all other cases, character set transcoding by Firebird will occur
without being taken into account by kinterbasdb.

Perhaps an example makes this more explicit:

* Column character set is ISO8859_1
* Connection character set is set to DOS850

The sequence of events:
(1) Python wants to store a string consisting of a single character
U+00F6 (small o-umlaut).
(2) kinterbasdb matches ISO8859_1 (the column character set) to
Pythons character set of the same name (and fortunately same
semantics) and calls unicodeString.encode(pyEncodingName). This will
result in an 8-bit string holding a single byte of 0xF6
(3) Firebird, believing in the specified connection character set
DOS850, transcodes this to an intermediate Unicode representation,
yielding U+00F7 (division sign) and then to the column character set,
resulting in 0xF7 being stored.

Perversely, when reading from the database by the same setup, no
problem will be seen (in this case) as the faulty sequence of
transcodings will occur in reverse and the damage is undone (but IMHO
this only makes it worse).

Regards,
Peter