firebird-java - Re: [Firebird-Java] Urgend character set problem

Subject	Re: [Firebird-Java] Urgend character set problem
Author	Roman Rokytskyy
Post date	2006-04-20T21:04:05Z

Hi,

Let's continue in the list, since your topic seems to be a good example of
charset handling. Also I assume that you use charSet property, since our
properties are case sensitive (btw, should we make them case insensitive?).

> When I keep the datasource parameters at charset="ISO-8859-1" and
> encoding="ISO8859_1" (which match the database) and have everything in
> java
> / tomcat / html changed to UTF-8, then german umlauts (such as "ЭД" - in
> html ü ä) work as expected.

Yup, that is ok.

> However typographic quotations marks (such as "└⌠" - in html „
> “) don't. Loading them from the database results in (int)
> string.charAt() == 132 / 147. However output to the html document
> (utf-8) trashes them.

Which is also ok. And here is why. If you check the following URL
(http://www.unicode.org/charts/PDF/U0080.pdf) you will see that these two
characters belong to the control area. Those are not quotation marks.

> Getting
> them from input from the browser, I get (int) string.charAt() == 8220 /
> 8222. However storing them to the database trashes them.

Is somehow "correct", since Java cannot convert Unicode character #201C into
ISO-Latin-1 charset. Question is why you get #0084 out of that instead of
#003F (question mark)... but you can be sure that they will be trashed.

> Setting charset="utf-8" and encoding="iso8859_1" trashes all special
> characters.

This is a bit strange, umlauts should be untouched - they are the same in
UTF-8 and ISO-Latin-1... Strictly speaking, we should check this.

> Whenever I set encoding="unicode_fss" (with unchanged database) and have
> charset="utf-8" or charset="iso-8859-1", I get
> "org.firebirdsql.jdbc.FBSQLException: GDS Exception. 335544321. arithmetic
> exception, numeric overflow, or string truncation" when loading data from
> the database."

Which is correct, since Firebird tries to map #0084 into some Unicode
character and fails because it belongs to the control area. Same if you set
charSet="ISO-8859-1", #0084 is not there.

> First thank you very much for that offer.
> However, as time went on, I needed to restore a backup (loosing about 10
> days' work, but restoring the whole productive system to iso-8859-1 and
> therefor working more or less), so now it is not that urgend anymore. It
> however still stalls any progress.

Ok.

> Second, you seem to have mistaken me for someone. I certainly do not have
> your handy number.

I though some time ago I sent it to you... anyway, I will do this in
separate email.

> I must agree that I am not sure, how it should work at all.
> I always thought java would keep strings in memory always in the same
> encoding (utf-8) and would only translate that to other charsets on input
> /
> output. Given that, it would be the driver to tell java upon string
> creation
> to do the translation. However this seems not to work.
>
> My biggest problem is to not know where to look.
>
> I would appeciate any help very much.

Your problem is that you try to store in the ISO-Latin-1 database characters
that do not belong there. Any combination of normal encoding and charSet
properties will fail.

The only workaround, again, workaround, not a solution, is to use the
"trash-in, trash-out" technique. In this case you set encoding="NONE", which
tells Firebird not to interpret the contents of the string and simply
transfer it to the field; and charSet="UTF-8", saying Jaybird to convert
Java strings into byte[] using the UTF-8 encoding scheme.

Now why this is a workaround. This will work ok, if you use Firebird as a
storage only. But as soon as you try to do charset conversion (e.g. connect
using some other client with lc_ctype other than NONE or OCTETS) you will
get your "Failed to transliterate characters between character sets".

The correct solution is to convert database into UNICODE_FSS or UTF8 (in FB
2.0) database. But before you do this, you have to do a cleanup - go through
all records in the database and remove from them characters that do not
belong to ISO-Latin-1 charset. Or you simply ensure that non-ISO-Latin-1
characters are not stored in the database.

Hope this helps.

Roman