Subject | Re: [Firebird-Java] lc_ctype support |
---|---|
Author | Marczisovszky Daniel |
Post date | 2002-04-08T20:40:31Z |
Hi,
I'll check the new sources soon, you should know I have not seen that,
but I want to tell my thoughts about this question. If I write
something that is not clear, please ask, as this is the most
complicated area in computing (I mean encodings).
First of all (as usual) I have big problem with the new code. Actually
we can argue about it, moreover I wish to argue about it. It will be a
bit hard explain, as there is no way to write Unicode characters in an
email, but I'll try.
In the Hungarian language, for example, we have a character that
looks like this: (it's Unicode code is 0x151)
X X
X X
X X
XXXXX
X X
X X
X X
XXXXX
But for a long time we had to use (until Unicode fonts and Unicode
operation systems and languages turned up) Unicode 0xF5 which is
actually ASCII 0xF5 that looks like this:
XX X
X XX
XXXXX
X X
X X
X X
XXXXX
Still, in our days, we have to use 0xF5 in many areas that still do
not support Unicode, like web development (yes, even the latest
version of Tomcat has serious bugs in this area) As example I could
also mention many other JDBC driver or even with the latest JDK it is
not possibly to read from a random access file with encodings.
What does that mean? It means, that when I read data from such
sources, I can not specify the ISO-8859-2 encoding and when I get the
data, I will get Unicode 0xF5 instead of 0x151 (which is the real
Hungarian ő - this email was written with encoding 8859-2 so if your
client supports it, you should see that first (upper) character).
Quick example: when I read parameters from a web form, I'll get the
values as ISO-8859-1, and I'll have 0xF5, not 0x151, although the new
servlet API supports encoding in a few cases, but not for every case.
Moreover, I have to write back 0xF5 if I want the browser to
understand it. Same happens when I read and write from and to many
JDBC sources. This is not a problem, as internally I know that 0xF5
should be "interpreted" as Hungarian ő (0x151), but most of time I
even don't have to care about that as there are no character-level
operations.
After such a long introduction, my problem is coming: when set
lc_ctype with the new driver, it forces me to pass strings in
ISO-8859-2 format. So if I want to write a string with setString, for
example, it must not contain 0xF5, although I have it everywhere. If
it has, I'll have the usual arithmetic conversion... exception.
Moreover when I read the string back, it will contain 0x151, which I
can not write back to the browser (or a random access file, or another
SQL server with a different JDBC driver).
I believe that this is not a problem of the Hungarian language only,
but many Eastern language. While we don't have Unicode *everywhere*, I
think the driver should not force such a translation, or the developer
should be able to switch it off, otherwise, translations will be
required every time where a string-database operation occures.
Unfortunately still *many* software components except strings in
ISO-8859-1 encoding.
We should note that Firebird lc_ctype encoding is not always the same
as Java encoding. Of course I happily implement this feature as I
really need it :)
You may also ask why to set lc_ctype at all if I want to switch off
the character translation, as in that case the driver would not use
encoding at all? Question is right: because if we do not specify the
lc_ctype, Firebird even don't accept characters like 0xF5. (same
arithmetic exception)
What I would really need is: support for lc_ctype (now we have this)
but to make translation (or character encoding or whatever it's name)
disable if I need that.
I don't know if such feature would fit in the official driver, I hope
answer is yes. I really want to hear the opinions from those people
who develop softwares with a different character set from the Western
European.
How would I imagine that? Maybe I pass a parameter through the
Properties object for the Connection that would override the result
string in getIscEncoding() in FBManagedConnection.
I hope I was clear just as my problem, although this a rather complex
question...
Best wishes,
Daniel
I'll check the new sources soon, you should know I have not seen that,
but I want to tell my thoughts about this question. If I write
something that is not clear, please ask, as this is the most
complicated area in computing (I mean encodings).
First of all (as usual) I have big problem with the new code. Actually
we can argue about it, moreover I wish to argue about it. It will be a
bit hard explain, as there is no way to write Unicode characters in an
email, but I'll try.
In the Hungarian language, for example, we have a character that
looks like this: (it's Unicode code is 0x151)
X X
X X
X X
XXXXX
X X
X X
X X
XXXXX
But for a long time we had to use (until Unicode fonts and Unicode
operation systems and languages turned up) Unicode 0xF5 which is
actually ASCII 0xF5 that looks like this:
XX X
X XX
XXXXX
X X
X X
X X
XXXXX
Still, in our days, we have to use 0xF5 in many areas that still do
not support Unicode, like web development (yes, even the latest
version of Tomcat has serious bugs in this area) As example I could
also mention many other JDBC driver or even with the latest JDK it is
not possibly to read from a random access file with encodings.
What does that mean? It means, that when I read data from such
sources, I can not specify the ISO-8859-2 encoding and when I get the
data, I will get Unicode 0xF5 instead of 0x151 (which is the real
Hungarian ő - this email was written with encoding 8859-2 so if your
client supports it, you should see that first (upper) character).
Quick example: when I read parameters from a web form, I'll get the
values as ISO-8859-1, and I'll have 0xF5, not 0x151, although the new
servlet API supports encoding in a few cases, but not for every case.
Moreover, I have to write back 0xF5 if I want the browser to
understand it. Same happens when I read and write from and to many
JDBC sources. This is not a problem, as internally I know that 0xF5
should be "interpreted" as Hungarian ő (0x151), but most of time I
even don't have to care about that as there are no character-level
operations.
After such a long introduction, my problem is coming: when set
lc_ctype with the new driver, it forces me to pass strings in
ISO-8859-2 format. So if I want to write a string with setString, for
example, it must not contain 0xF5, although I have it everywhere. If
it has, I'll have the usual arithmetic conversion... exception.
Moreover when I read the string back, it will contain 0x151, which I
can not write back to the browser (or a random access file, or another
SQL server with a different JDBC driver).
I believe that this is not a problem of the Hungarian language only,
but many Eastern language. While we don't have Unicode *everywhere*, I
think the driver should not force such a translation, or the developer
should be able to switch it off, otherwise, translations will be
required every time where a string-database operation occures.
Unfortunately still *many* software components except strings in
ISO-8859-1 encoding.
We should note that Firebird lc_ctype encoding is not always the same
as Java encoding. Of course I happily implement this feature as I
really need it :)
You may also ask why to set lc_ctype at all if I want to switch off
the character translation, as in that case the driver would not use
encoding at all? Question is right: because if we do not specify the
lc_ctype, Firebird even don't accept characters like 0xF5. (same
arithmetic exception)
What I would really need is: support for lc_ctype (now we have this)
but to make translation (or character encoding or whatever it's name)
disable if I need that.
I don't know if such feature would fit in the official driver, I hope
answer is yes. I really want to hear the opinions from those people
who develop softwares with a different character set from the Western
European.
How would I imagine that? Maybe I pass a parameter through the
Properties object for the Connection that would override the result
string in getIscEncoding() in FBManagedConnection.
I hope I was clear just as my problem, although this a rather complex
question...
Best wishes,
Daniel