Subject | Re: [Firebird-Java] Re: lc_ctype support |
---|---|
Author | Marczisovszky Daniel |
Post date | 2002-04-08T23:03:23Z |
Yes, they work, as there is a bug in the driver, so it actually makes
no translation:
In FBField.java at line 396
if (iscEncoding != null && !iscEncoding.equalsIgnoreCase("NONE"))
FBConnectionHelper.getJavaEncoding(iscEncoding);
should be replaced with this:
if (iscEncoding != null && !iscEncoding.equalsIgnoreCase("NONE"))
javaEncoding = FBConnectionHelper.getJavaEncoding(iscEncoding);
otherwise javaEncoding will be always null, so no encoding will be
used.
Before you correct this, please create a table with WIN1250. And try
this:
PreparedStatement pst = conn.prepareStatement("insert into honap (hosszunev) values (?)");
pst.setString(1, "õrült");
pst.executeUpdate();
Note in the second line that character is 0xF5 so it may be replace
with \u00F5 in the source code.
Please also run this after you corrected the bug. You will see that
the first character is replaced by question mark.
After you corrected this, you will see very interesting results.
Your german test will work fine. Why? Because actually there is no
translation. Every character in your german test has the same Unicode
and ASCII code.
But your Ukrainian test will not work at all. First you'll get
arithmetic exception. If you write only to the Unicode field you will
see many of your characters are lost.
Why? Because you fall in the same bug as many people in the world.
You're trying to communicate with ASCII characters instead of Unicode.
Hope this absolutely makes clear what is the problem:
public static String UKRAINIAN_TEST_STRING_WIN1251 =
"\u00f2\u00e5\u00f1\u00f2\u00ee\u00e2\u00e0 " +
"\u00f1\u00f2\u00f0\u00b3\u00f7\u00ea\u00e0";
in this string almost none of the characters is valid in Cp1252,
because these are ASCII characters. You will be able to write this
string to the database correctly if you specify the Unicode codes for
these characters and that way getBytes will translate back them to the
"good" ASCII codes instead of the question marks.
These characters in this string has no meaning when you use Cp1252,
and they will be converted to question marks.
There is also one more bug, that I realized when I saw your encoding
test. What happens, when there are different fields with different
encodings in the DBMS? Because the driver uses the encoding derived
from lc_ctype of the *connection* not encoding from the field. This
means if you use WIN1250 for a connection then you will be able to
write only that field that has WIN120 character set. Actually if you
try to write to a field that has, let's say WIN1252, then a few of
your national characters will be replaced by ?, because the string
will be encoded to a byte array using Cp1250 and not Cp1252.
I know this is a very complex and complicated issue. But I think this
examples that I showed in the last two hours makes clear that the
automatic character encoding may only be an extra feature that the
driver supports, but one may want switch that off completely,
otherwise you won't be able to handle such strings that are already in
their ASCII format and won't be able to handle tables that contains
fields with different character sets.
Best wishes,
Daniel
no translation:
In FBField.java at line 396
if (iscEncoding != null && !iscEncoding.equalsIgnoreCase("NONE"))
FBConnectionHelper.getJavaEncoding(iscEncoding);
should be replaced with this:
if (iscEncoding != null && !iscEncoding.equalsIgnoreCase("NONE"))
javaEncoding = FBConnectionHelper.getJavaEncoding(iscEncoding);
otherwise javaEncoding will be always null, so no encoding will be
used.
Before you correct this, please create a table with WIN1250. And try
this:
PreparedStatement pst = conn.prepareStatement("insert into honap (hosszunev) values (?)");
pst.setString(1, "õrült");
pst.executeUpdate();
Note in the second line that character is 0xF5 so it may be replace
with \u00F5 in the source code.
Please also run this after you corrected the bug. You will see that
the first character is replaced by question mark.
After you corrected this, you will see very interesting results.
Your german test will work fine. Why? Because actually there is no
translation. Every character in your german test has the same Unicode
and ASCII code.
But your Ukrainian test will not work at all. First you'll get
arithmetic exception. If you write only to the Unicode field you will
see many of your characters are lost.
Why? Because you fall in the same bug as many people in the world.
You're trying to communicate with ASCII characters instead of Unicode.
Hope this absolutely makes clear what is the problem:
public static String UKRAINIAN_TEST_STRING_WIN1251 =
"\u00f2\u00e5\u00f1\u00f2\u00ee\u00e2\u00e0 " +
"\u00f1\u00f2\u00f0\u00b3\u00f7\u00ea\u00e0";
in this string almost none of the characters is valid in Cp1252,
because these are ASCII characters. You will be able to write this
string to the database correctly if you specify the Unicode codes for
these characters and that way getBytes will translate back them to the
"good" ASCII codes instead of the question marks.
These characters in this string has no meaning when you use Cp1252,
and they will be converted to question marks.
There is also one more bug, that I realized when I saw your encoding
test. What happens, when there are different fields with different
encodings in the DBMS? Because the driver uses the encoding derived
from lc_ctype of the *connection* not encoding from the field. This
means if you use WIN1250 for a connection then you will be able to
write only that field that has WIN120 character set. Actually if you
try to write to a field that has, let's say WIN1252, then a few of
your national characters will be replaced by ?, because the string
will be encoded to a byte array using Cp1250 and not Cp1252.
I know this is a very complex and complicated issue. But I think this
examples that I showed in the last two hours makes clear that the
automatic character encoding may only be an extra feature that the
driver supports, but one may want switch that off completely,
otherwise you won't be able to handle such strings that are already in
their ASCII format and won't be able to handle tables that contains
fields with different character sets.
Best wishes,
Daniel