Subject | Re: [Firebird-Java] Urgend character set problem |
---|---|
Author | Roman Rokytskyy |
Post date | 2006-04-20T21:04:05Z |
Hi,
Let's continue in the list, since your topic seems to be a good example of
charset handling. Also I assume that you use charSet property, since our
properties are case sensitive (btw, should we make them case insensitive?).
(http://www.unicode.org/charts/PDF/U0080.pdf) you will see that these two
characters belong to the control area. Those are not quotation marks.
ISO-Latin-1 charset. Question is why you get #0084 out of that instead of
#003F (question mark)... but you can be sure that they will be trashed.
UTF-8 and ISO-Latin-1... Strictly speaking, we should check this.
character and fails because it belongs to the control area. Same if you set
charSet="ISO-8859-1", #0084 is not there.
separate email.
that do not belong there. Any combination of normal encoding and charSet
properties will fail.
The only workaround, again, workaround, not a solution, is to use the
"trash-in, trash-out" technique. In this case you set encoding="NONE", which
tells Firebird not to interpret the contents of the string and simply
transfer it to the field; and charSet="UTF-8", saying Jaybird to convert
Java strings into byte[] using the UTF-8 encoding scheme.
Now why this is a workaround. This will work ok, if you use Firebird as a
storage only. But as soon as you try to do charset conversion (e.g. connect
using some other client with lc_ctype other than NONE or OCTETS) you will
get your "Failed to transliterate characters between character sets".
The correct solution is to convert database into UNICODE_FSS or UTF8 (in FB
2.0) database. But before you do this, you have to do a cleanup - go through
all records in the database and remove from them characters that do not
belong to ISO-Latin-1 charset. Or you simply ensure that non-ISO-Latin-1
characters are not stored in the database.
Hope this helps.
Roman
Let's continue in the list, since your topic seems to be a good example of
charset handling. Also I assume that you use charSet property, since our
properties are case sensitive (btw, should we make them case insensitive?).
> When I keep the datasource parameters at charset="ISO-8859-1" andYup, that is ok.
> encoding="ISO8859_1" (which match the database) and have everything in
> java
> / tomcat / html changed to UTF-8, then german umlauts (such as "ЭД" - in
> html ü ä) work as expected.
> However typographic quotations marks (such as "└⌠" - in html „Which is also ok. And here is why. If you check the following URL
> “) don't. Loading them from the database results in (int)
> string.charAt() == 132 / 147. However output to the html document
> (utf-8) trashes them.
(http://www.unicode.org/charts/PDF/U0080.pdf) you will see that these two
characters belong to the control area. Those are not quotation marks.
> GettingIs somehow "correct", since Java cannot convert Unicode character #201C into
> them from input from the browser, I get (int) string.charAt() == 8220 /
> 8222. However storing them to the database trashes them.
ISO-Latin-1 charset. Question is why you get #0084 out of that instead of
#003F (question mark)... but you can be sure that they will be trashed.
> Setting charset="utf-8" and encoding="iso8859_1" trashes all specialThis is a bit strange, umlauts should be untouched - they are the same in
> characters.
UTF-8 and ISO-Latin-1... Strictly speaking, we should check this.
> Whenever I set encoding="unicode_fss" (with unchanged database) and haveWhich is correct, since Firebird tries to map #0084 into some Unicode
> charset="utf-8" or charset="iso-8859-1", I get
> "org.firebirdsql.jdbc.FBSQLException: GDS Exception. 335544321. arithmetic
> exception, numeric overflow, or string truncation" when loading data from
> the database."
character and fails because it belongs to the control area. Same if you set
charSet="ISO-8859-1", #0084 is not there.
> First thank you very much for that offer.Ok.
> However, as time went on, I needed to restore a backup (loosing about 10
> days' work, but restoring the whole productive system to iso-8859-1 and
> therefor working more or less), so now it is not that urgend anymore. It
> however still stalls any progress.
> Second, you seem to have mistaken me for someone. I certainly do not haveI though some time ago I sent it to you... anyway, I will do this in
> your handy number.
separate email.
> I must agree that I am not sure, how it should work at all.Your problem is that you try to store in the ISO-Latin-1 database characters
> I always thought java would keep strings in memory always in the same
> encoding (utf-8) and would only translate that to other charsets on input
> /
> output. Given that, it would be the driver to tell java upon string
> creation
> to do the translation. However this seems not to work.
>
> My biggest problem is to not know where to look.
>
> I would appeciate any help very much.
that do not belong there. Any combination of normal encoding and charSet
properties will fail.
The only workaround, again, workaround, not a solution, is to use the
"trash-in, trash-out" technique. In this case you set encoding="NONE", which
tells Firebird not to interpret the contents of the string and simply
transfer it to the field; and charSet="UTF-8", saying Jaybird to convert
Java strings into byte[] using the UTF-8 encoding scheme.
Now why this is a workaround. This will work ok, if you use Firebird as a
storage only. But as soon as you try to do charset conversion (e.g. connect
using some other client with lc_ctype other than NONE or OCTETS) you will
get your "Failed to transliterate characters between character sets".
The correct solution is to convert database into UNICODE_FSS or UTF8 (in FB
2.0) database. But before you do this, you have to do a cleanup - go through
all records in the database and remove from them characters that do not
belong to ISO-Latin-1 charset. Or you simply ensure that non-ISO-Latin-1
characters are not stored in the database.
Hope this helps.
Roman