Subject | Re: [Firebird-Java] Re: lc_ctype support |
---|---|
Author | Marczisovszky Daniel |
Post date | 2002-04-08T21:48:55Z |
r> hmmm... seems to me that I don't completely understand your problem...
r> Why cannot you:
r> a) use UNICODE_FSS everywhere?
r> b) use NONE everywhere?
Reason is where simple, although in theory that would be solution:
because the Hungarian collation is supported ONLY if I use WIN1250.
This is true for many other language. There is no way to perform an
ORDER BY correctly without using WIN1250 in Hungarian, same applies
for other languages.
r> (here "everywhere" means in servlet container, jdbc connection,
r> database default character set, char and varchar column definitions)
Although servlet containers supports Unicode internally, the problem
is that many software (like the container, like many JDBC driver, so
many data sources and data destinations) communicate with the world
with 8 bit ASCII. Because many browsers does not support encodings at
all, there is no way to decide in which encoding were the form-data
written. Same for many SQL server: as they do not support Unicode
internally, their JDBC drivers expects ASCII encoded characters, so
the highest character can be 0xFF. 0x151 will be replaced by a ? when
you're trying to write to such a stream.
r> Well... I have not too much experience with unicode in Java, but the
r> problem you're describing seems to be either in JDK (that is not able
r> to translate chars between charsets correctly) or in
r> InterBase/Firebird that has no support for \u0151 or \u00f5
r> characters.
JDK is able to convert. When you read from a stream that has
ISO-8859-2 encoding that reads 0xF5 you will get 0x151 in the Java
string, but this character is only one example!
The matter of Unicode with Jave in my eyes is the following: Java uses
Unicode internally, but when it have to communicate with the outer
world (read from a file, write to a stream) then it converts the
Unicode characters to one of 0x00 - 0xFF according to the given
encoding, except if you use UTF-8 or UTF-16, but sadly not too much
software is able to understand that.
Same applies to Firebird if I understand the communication protocol
well. The JDBC driver sends and receives strings as 8 bit bytes and
not as UTF-8 nor 16 bit Unicode characters. The problems is that the
JDBC driver may not be sure about how a String encoded. It's quite
rare that a string contains letters from many languages. But one can
not determine that a string was encoded with ISO-8859-1 or ISO-8859-2.
If I have character 0x151 in a string that I want write to database,
then one (or the JDBC driver) should not assume that the string is
encoded with that encoding that the connection uses. Maybe the string
encoding in the Java system has absolutely nothing to do with the
lc_ctype in Firebird.
To make it clearer, here is an example:
lc_ctype is set to WIN1250 which means FBStringField will use
ISO-8859-2.
When I want to write this string: 0xF5 (a little, one character long
string) to the database, setString will call
getBytes(value, getIscEncoding());
where getIscEncoding returns ISO-8859-2. As there is no 0xF5 character
when getBytes uses encoding ISO-8859-2, it will result byte 63 which a
question mark (every unknown character is replaced by a question
mark...) This means one will never be able to recover the original
value of that string. SetString will work correctly if and only if the
original string was encoded by ISO-8859-2, and it contains not 0xF5
but, 0x151.
IMHO, the driver should not assume this (ok, it may assume this,
because probably strings are encoded with same encoding as the
database) but you should not force that, as data will be lost. My
original 0xF5 will be replaced with a damned question mark.
What does that mean? One is not able to write a string to the database
if it was not encoded with ISO-8859-2, e.g. it may contain only those
Unicode characters that are valid in ISO-8859-2. Those character in
that string that don't belong to this set will be converted to ?.
But: let's say I want to write a string originally encoded with the DOS
852 codepage. I won't be able to do that, because getBytes will kill
many Unicode character, as getBytes will use ISO-8859-2 for
conversion, although the String contains such characters that are
valid in DOS 852.
Let's see the other side: if the String is already converted to ASCII
(which means every language specific, Unicode characters are converted
back to below 0xFF) then I will be able to write it to database, if
getBytes does not use getIscEncoding(). This works like this because
Interbase is also such a software that communicates with the world
using ASCII and not Unicode.
When you set any encoding for getBytes, it means only those characters
will converted to a valid byte that are valid to that encoding. Every
other character will be replaced by ?. I think this is the same
situation when Alejandro added automatic encoding detection: the
driver (or any other software) may guess the encoding (e.g. using your
default system locale or with lc_ctype) but it *should* not force the
developer to use that, as only the developer is that who really know
which encoding is used in his strings.
r> My personal opinion is: we should not perform any conversion in the
r> driver that goes behind the Java and InterBase/Firebird conversions.
r> If there are bugs in JDK or DBMS they should be fixed there.
This is not a bug of JDK, but we may call it a bug of the DBMS. But is
it really a bug? How to specify a collation order for Unicode? The
order of characters in Unicode is not a real collation order.
r> I'll try to read your letter again tomorrow (it's 23:00 right now and
r> I'm not in good condition).
Well, same for me, it's almost midnight...
r> Best regards,
r> Roman Rokytskyy
Best wishes,
Daniel
r> Why cannot you:
r> a) use UNICODE_FSS everywhere?
r> b) use NONE everywhere?
Reason is where simple, although in theory that would be solution:
because the Hungarian collation is supported ONLY if I use WIN1250.
This is true for many other language. There is no way to perform an
ORDER BY correctly without using WIN1250 in Hungarian, same applies
for other languages.
r> (here "everywhere" means in servlet container, jdbc connection,
r> database default character set, char and varchar column definitions)
Although servlet containers supports Unicode internally, the problem
is that many software (like the container, like many JDBC driver, so
many data sources and data destinations) communicate with the world
with 8 bit ASCII. Because many browsers does not support encodings at
all, there is no way to decide in which encoding were the form-data
written. Same for many SQL server: as they do not support Unicode
internally, their JDBC drivers expects ASCII encoded characters, so
the highest character can be 0xFF. 0x151 will be replaced by a ? when
you're trying to write to such a stream.
r> Well... I have not too much experience with unicode in Java, but the
r> problem you're describing seems to be either in JDK (that is not able
r> to translate chars between charsets correctly) or in
r> InterBase/Firebird that has no support for \u0151 or \u00f5
r> characters.
JDK is able to convert. When you read from a stream that has
ISO-8859-2 encoding that reads 0xF5 you will get 0x151 in the Java
string, but this character is only one example!
The matter of Unicode with Jave in my eyes is the following: Java uses
Unicode internally, but when it have to communicate with the outer
world (read from a file, write to a stream) then it converts the
Unicode characters to one of 0x00 - 0xFF according to the given
encoding, except if you use UTF-8 or UTF-16, but sadly not too much
software is able to understand that.
Same applies to Firebird if I understand the communication protocol
well. The JDBC driver sends and receives strings as 8 bit bytes and
not as UTF-8 nor 16 bit Unicode characters. The problems is that the
JDBC driver may not be sure about how a String encoded. It's quite
rare that a string contains letters from many languages. But one can
not determine that a string was encoded with ISO-8859-1 or ISO-8859-2.
If I have character 0x151 in a string that I want write to database,
then one (or the JDBC driver) should not assume that the string is
encoded with that encoding that the connection uses. Maybe the string
encoding in the Java system has absolutely nothing to do with the
lc_ctype in Firebird.
To make it clearer, here is an example:
lc_ctype is set to WIN1250 which means FBStringField will use
ISO-8859-2.
When I want to write this string: 0xF5 (a little, one character long
string) to the database, setString will call
getBytes(value, getIscEncoding());
where getIscEncoding returns ISO-8859-2. As there is no 0xF5 character
when getBytes uses encoding ISO-8859-2, it will result byte 63 which a
question mark (every unknown character is replaced by a question
mark...) This means one will never be able to recover the original
value of that string. SetString will work correctly if and only if the
original string was encoded by ISO-8859-2, and it contains not 0xF5
but, 0x151.
IMHO, the driver should not assume this (ok, it may assume this,
because probably strings are encoded with same encoding as the
database) but you should not force that, as data will be lost. My
original 0xF5 will be replaced with a damned question mark.
What does that mean? One is not able to write a string to the database
if it was not encoded with ISO-8859-2, e.g. it may contain only those
Unicode characters that are valid in ISO-8859-2. Those character in
that string that don't belong to this set will be converted to ?.
But: let's say I want to write a string originally encoded with the DOS
852 codepage. I won't be able to do that, because getBytes will kill
many Unicode character, as getBytes will use ISO-8859-2 for
conversion, although the String contains such characters that are
valid in DOS 852.
Let's see the other side: if the String is already converted to ASCII
(which means every language specific, Unicode characters are converted
back to below 0xFF) then I will be able to write it to database, if
getBytes does not use getIscEncoding(). This works like this because
Interbase is also such a software that communicates with the world
using ASCII and not Unicode.
When you set any encoding for getBytes, it means only those characters
will converted to a valid byte that are valid to that encoding. Every
other character will be replaced by ?. I think this is the same
situation when Alejandro added automatic encoding detection: the
driver (or any other software) may guess the encoding (e.g. using your
default system locale or with lc_ctype) but it *should* not force the
developer to use that, as only the developer is that who really know
which encoding is used in his strings.
r> My personal opinion is: we should not perform any conversion in the
r> driver that goes behind the Java and InterBase/Firebird conversions.
r> If there are bugs in JDK or DBMS they should be fixed there.
This is not a bug of JDK, but we may call it a bug of the DBMS. But is
it really a bug? How to specify a collation order for Unicode? The
order of characters in Unicode is not a real collation order.
r> I'll try to read your letter again tomorrow (it's 23:00 right now and
r> I'm not in good condition).
Well, same for me, it's almost midnight...
r> Best regards,
r> Roman Rokytskyy
Best wishes,
Daniel