Subject | background about encodings |
---|---|
Author | Marczisovszky Daniel |
Post date | 2002-04-08T23:40:29Z |
Hi,
This may be a bit offtopic, but I see there is confusion about Java
encodings. I'm sorry for those who already know this, I think this is
an important question.
A long-long time ago... Only people in the United States had
computers. They have only 26 characters in their alphabet, so the
ASCII standard (that allows only 7 bits, thus 128 characters) were
perfect for them. But computers started to konquer the world world :)
Many nation who're using more characters than 26 had find solutions.
This led to extending the ASCII table to 8 bits and everybody added
their characters to the upper 127 character.
Because in every country the upper 127 characters meant competely
different, character sets were born. So if you have a text with a
Russian character set, then you have *display* different characters
from a font for the same ASCII character than you have a Western
European character set. Simply the character 0xF5 (this will be my
favourite one :)) has many faces. According to the character set or
the encoding the character 0xF5 appears sometimes as a Russian, a
Greek or a Western character. It depends completely on the encoding,
and from a given string *nobody* is able to determine the encoding if
that is not given. Although there are attempts to do that, but that
requires knowledge about languages, so I believe this is impossible
with our current technologies.
Java wanted to work around this problem, and they introduced
encodings. What the hell is an encoding? Internally and in your Java
program every character is stored as a Unicode character. So if you
want to display a Russian character that in the dark ages you stored
as 0xF5 then now you have to store its Unicode value that will be
around 0x400-0x460...
But we have a big inheritage from these old ages. We have browsers,
standards, softwares, operation systems, database systems that still
uses this "character set" thing instead of the clean Unicode system.
Java's Unicode support is only useful until you stay in the Java
system, but Java system should read and write data from outside.
How to do that? How can we know how which Unicode character should be
stored in string when we read 0xF5? Answer is: we don't know. The
developer should specify that.
That's why Sun gave encodings to us. ;) When you specify and encoding
for reading then actually you specify a table that has 256 elements.
With the ASCII code of the read character (or *byte*) you look up the
appropriate Unicode character to that character. For example, in the
Hungarian language according to the character set, the ASCII 0xF5 will
be converted to 0x151.
When you write back a string and you specifies the same encoding,
Unicode 0x151 in Java will be 0xF5. This exactly the reverse process.
If I specify ISO-8859-2, then I will get 0xF5 converted to 0x151 and
it will be converted back to 0xF5 when I write it back. (to a txt
file, to a JDBC database, to a browser in servlet, etc...)
If I don't specify the encoding, the byte 0xF5 will be read as 0xF5 it
won't be converted, but if we don't touch it in the Java program then
it will be written back as 0xF5.
But what happens with our JDBC driver? Because Firebird also has this
inheritage and usually it expects ASCII characters (bytes) from the
client. Now the JDBC driver converts the Unicode string which may have
any valid Unicode character according to an encoding that is derived
from lc_ctype. But that is not good. As I mentioned noone is able to
determine from a string which encoding should be used to convert that
string to a series of bytes, actually to an ASCII string. Currently
the driver *assumes* this encoding is the same as lc_ctype. But if the
string was created, let's say from an ASCII file, with a different
encoding, then the driver will "kill" a few of our characters, because
the encoding that was used to create the string contains different
Unicode characters from that encoding the the driver assumes. We
should not forget that in an encoding there are only 256 valid Unicode
characters (this is not true exactly, as there are hacks as usual,
like UTF-8, etc... but I want to show the point)
So it's good if the driver can automatically convert a Unicode string
(that contains characters with code greater than 0xFF) without
specifing what character should it use. It simply guesses from
lc_ctype. But because this process is nothing else as converting a
series of Unicode characters to ASCII according to a given table, this
should not be forced. If the string is already converted to it's ASCII
format one should be able to write that to the database without any
further conversion. One may say use setBytes and getBytes. This is
true, but I think it is really inconveninent.
I should tell that this is not a bug of the Java SDK nor a bug of
Firebird or the driver. This is a bad inheritage that will stay with
us for 4-5 years, maybe longer. This the result of different character
storing systems. Future is that way that Java showes, but it is long
time to change to Unicode in every area. I believe currently we have
to live with this problem and we should provide alternatives to the
developer.
Best wishes,
Daniel
I promise this was my last long mail ;)
This may be a bit offtopic, but I see there is confusion about Java
encodings. I'm sorry for those who already know this, I think this is
an important question.
A long-long time ago... Only people in the United States had
computers. They have only 26 characters in their alphabet, so the
ASCII standard (that allows only 7 bits, thus 128 characters) were
perfect for them. But computers started to konquer the world world :)
Many nation who're using more characters than 26 had find solutions.
This led to extending the ASCII table to 8 bits and everybody added
their characters to the upper 127 character.
Because in every country the upper 127 characters meant competely
different, character sets were born. So if you have a text with a
Russian character set, then you have *display* different characters
from a font for the same ASCII character than you have a Western
European character set. Simply the character 0xF5 (this will be my
favourite one :)) has many faces. According to the character set or
the encoding the character 0xF5 appears sometimes as a Russian, a
Greek or a Western character. It depends completely on the encoding,
and from a given string *nobody* is able to determine the encoding if
that is not given. Although there are attempts to do that, but that
requires knowledge about languages, so I believe this is impossible
with our current technologies.
Java wanted to work around this problem, and they introduced
encodings. What the hell is an encoding? Internally and in your Java
program every character is stored as a Unicode character. So if you
want to display a Russian character that in the dark ages you stored
as 0xF5 then now you have to store its Unicode value that will be
around 0x400-0x460...
But we have a big inheritage from these old ages. We have browsers,
standards, softwares, operation systems, database systems that still
uses this "character set" thing instead of the clean Unicode system.
Java's Unicode support is only useful until you stay in the Java
system, but Java system should read and write data from outside.
How to do that? How can we know how which Unicode character should be
stored in string when we read 0xF5? Answer is: we don't know. The
developer should specify that.
That's why Sun gave encodings to us. ;) When you specify and encoding
for reading then actually you specify a table that has 256 elements.
With the ASCII code of the read character (or *byte*) you look up the
appropriate Unicode character to that character. For example, in the
Hungarian language according to the character set, the ASCII 0xF5 will
be converted to 0x151.
When you write back a string and you specifies the same encoding,
Unicode 0x151 in Java will be 0xF5. This exactly the reverse process.
If I specify ISO-8859-2, then I will get 0xF5 converted to 0x151 and
it will be converted back to 0xF5 when I write it back. (to a txt
file, to a JDBC database, to a browser in servlet, etc...)
If I don't specify the encoding, the byte 0xF5 will be read as 0xF5 it
won't be converted, but if we don't touch it in the Java program then
it will be written back as 0xF5.
But what happens with our JDBC driver? Because Firebird also has this
inheritage and usually it expects ASCII characters (bytes) from the
client. Now the JDBC driver converts the Unicode string which may have
any valid Unicode character according to an encoding that is derived
from lc_ctype. But that is not good. As I mentioned noone is able to
determine from a string which encoding should be used to convert that
string to a series of bytes, actually to an ASCII string. Currently
the driver *assumes* this encoding is the same as lc_ctype. But if the
string was created, let's say from an ASCII file, with a different
encoding, then the driver will "kill" a few of our characters, because
the encoding that was used to create the string contains different
Unicode characters from that encoding the the driver assumes. We
should not forget that in an encoding there are only 256 valid Unicode
characters (this is not true exactly, as there are hacks as usual,
like UTF-8, etc... but I want to show the point)
So it's good if the driver can automatically convert a Unicode string
(that contains characters with code greater than 0xFF) without
specifing what character should it use. It simply guesses from
lc_ctype. But because this process is nothing else as converting a
series of Unicode characters to ASCII according to a given table, this
should not be forced. If the string is already converted to it's ASCII
format one should be able to write that to the database without any
further conversion. One may say use setBytes and getBytes. This is
true, but I think it is really inconveninent.
I should tell that this is not a bug of the Java SDK nor a bug of
Firebird or the driver. This is a bad inheritage that will stay with
us for 4-5 years, maybe longer. This the result of different character
storing systems. Future is that way that Java showes, but it is long
time to change to Unicode in every area. I believe currently we have
to live with this problem and we should provide alternatives to the
developer.
Best wishes,
Daniel
I promise this was my last long mail ;)