firebird-support - Re: UTF8, malformed string error

Subject	Re: UTF8, malformed string error
Author	Roman Rokytskyy
Post date	2007-02-23T14:09Z

> But in the ideal case, it would be the IBO layer that does that? Or
> should I detect "UTF8" and do the conversion? How does the JayBird
> driver do this? Basically, I'm feeding it plain strings, not encoded
> ones.

In Java every string is Unicode (UCS2 or UCS4, depending on Java
version). In the setter method Jaybird checks the lc_ctype DPB
parameter and then tells Java to convert the Unicode string into byte
array in the specified encoding. For performance reasons we have
created our own encoder for one-byte encodings (WIN1250, WIN1251,
ISO-8859-1 and so on).

So, yes, you're correct - that is job of IBO to handle encoding
correctly when data are read via the [not-yet-existing-I-think]
property AsWideString. IBO has all needed information on the API level
(no need to query any RDB$XXX tables).

> Right, so in the case above, I'm getting a normal string in my
> TField definitions because Firebird translated it for me.

Correct. You have told Firebird to convert data into ISO-8859-1
encoding, which happens to be your default encoding on your Windows
computer (Control Panel->Regional Settings->Extended tab->Language for
non-Unicode programs). Therefore they were correctly displayed by
Windows. But if you would do that having the Russian regional settings
in Windows, you will see the different character.

> My applications support supplying the characterset, but they don't
> call any conversion routines, so what you're saying is that by using
> the charset when connecting, I'm telling Firebird that I'm sending
> UTF8 strings, while I'm not. Correct?

Right. If you set the lc_ctype=UTF8, Firebird expects to get your
CHAR/VARCHAR content encoded in UTF-8 encoding. It then interprets
byte sequences as characters and discovered something that did not fit
the conversion table. In case of ASCII characters that works, since
they remain the same - the "a" letter has the 0x0061 code in UCS2,
which gets converted into 0x61 in your default encoding in Windows,
which happen to be letter "a" in UTF-8 encoding.

Roman