Subject | Re: [Firebird-Java] Re: lc_ctype support |
---|---|
Author | Marczisovszky Daniel |
Post date | 2002-04-09T19:48:39Z |
r> Why do you need to pass 0xF5 to driver? You have to pass 0x151 that
r> is obtained by
r> unicodeStr = new String(win1250Str.getBytes(), "Cp1250");
Because this requires character conversions all over the source code,
moreover I can not pass 0xF5 to a column with WIN1250 encoding where
the lc_ctype for the connection is let's say WIN1252. Different
character sets in Firebird is quite useful for developing
multilanguage applications.
r> from the encoding of your connection to encoding of the column. In my
r> test case for Ukrainian, I do pass string to DBMS in WIN1251 encoding
r> and it is correctly converted by the DBMS to UNICODE_FSS.
Yes, because when you write an Unicode string to a stream it is UTF-8
encoded, but that is made by Java, not the DBMS. My experience is
Firebird stores those bytes what you write to it.
r> correct unicode string, all national charactes have codes the
r> correspond them in unicode table. If you have 0xF5 in unicode string,
r> that is not 0x151 (even if you mean it) and you hardly can expect the
r> Unicode <-> national charset conversion work correctly.
That is the reason why I want to switch it off in same cases.
Same example will make this clear:
I have a column one with WIN1250 and one with WIN1251. One of them
stores Hungarian and the other stores Russian strings. This a real
example from one of my works. How can I pass the Russian string to the
second field, if lc_ctype is set to WIN1250. Answer is there is no way
to do that currently, although it is possible, if I convert that
original Unicode string to bytes (or a string that is encoded with
ISO-8859-1, actually it stores the ASCII characters) AND the JDBC
driver makes no any further translation.
r> Connection encoding is the encoding in what you will be passing the
r> data. DBMS takes all the responsibility to store data in the encoding
r> specified for column. Driver just needs to provide data in the
r> encoding is specified for connection.
r> API Guide says (page 47):
r> "isc_dpb_lc_ctype String specifying the character set to be
r> utilized".
Character set is not an encoding. Character set is small subtype of
valid charactes. Character set is a set of bytes (0x00 - 0xFF) that
are valid in that character set and an interpretation about what
characters does the different bytes mean.
Encoding is an actual conversion. It replaces 16 bit characters with 8
bit bytes.
r> Language Reference (page 277):
r> "A character set defines the symbols that can be entered as text in a
r> column, and its also defines the maximum number of bytes of storage
r> necessary to represent each symbol.... Each character set also has an
r> implicit collation order that specifies how its symbols are sorted
r> and ordered."
Exactly what I said. This is not an encoding.
r> So, lc_ctype _is_ the character set of the client.
Agree, it is, but it still not an encoding. There is conversion in
most of the cases. Maybe it has some for UNICODE_FSS, but not for the
others.
r> Language Reference (page 285):
r> "SET NAMES specifies the character set the server should use when
r> translating data from the database to the client application.
r> Similarly, when the client sends data to the database, the server
r> translates the data from the client's character set to the database's
r> default character set (or the character set for an individual column
r> if it differs from the database's default character set)."
r> SET NAMES is setting the isc_dpb_lc_ctype for the connection. From
r> the citation you can find that server does perform translation from
r> charset of the column to the client's charset.
This is true, but I'm speaking about those cases where I don't want
such a server side translation. I pass the strings exactly in the
format that is required for the given column, and with exception of
UNICODE_FSS that is a set of 8-bit characters, not Unicode characters.
r> 0..127 and try to write 0xF5). It will accept it only if 0xF5 is
r> allowed in the character set of the connection (NONE, UNICODE_FSS,
r> WIN1251, etc.). But it will try to convert it according to the
r> charset of the database or column and throw an exception if it fails.
r> Therefore you cannot store data through the connection with NONE
r> charset into WIN1250 column, simply because Firebird has no hint how
r> the data you supplied must be converted into WIN1250 charset.
r> the charset, you will be able to read data in WIN1250 columns, but
r> not write.
This a quite serious issue for me. That is my problem.
r> Java does not have "unknown" encoding as well, it has "default" one.
r> What prevents you to use Cp1250 as the default encoding for your JVM?
r> Then you are sure that this is true:
r> new String(new byte[]{(byte)0xf5}).charAt(0) == 0x151
Encoding is not unknown, but in many cases that is ISO-8859-1.
Yes. Please remember your first encoding test with the Ukrainan text.
That was an Ukrainan text encoded with ISO-8859-1. Many sources in
Java provides strings in this format. There is no way to store that,
although it worked before character conversion was added.
r> In this way you might corrupt data in the
r> database. Connect with correct charset (UNICODE_FSS for example),
r> provide correct data (UTF8 for example) and Firebird will do the rest.
No it won't. I can not use proper UPPER and ORDER BY with UNICODE_FSS.
Of course if it works, blame on me, this argue has no meaning, but
otherwise...
r> But sure, you can add this feature in your driver, like an JVM option
r> (please, do not add a constant to GDS.java because this is an API).
Surely, I don't want to hurt the consistency of that.
r> Best regards,
r> Roman Rokytskyy
r> Yahoo! Groups SponsorADVERTISEMENT
r> To unsubscribe from this group, send an email to:
r> Firebird-Java-unsubscribe@yahoogroups.com
r> Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
r> is obtained by
r> unicodeStr = new String(win1250Str.getBytes(), "Cp1250");
Because this requires character conversions all over the source code,
moreover I can not pass 0xF5 to a column with WIN1250 encoding where
the lc_ctype for the connection is let's say WIN1252. Different
character sets in Firebird is quite useful for developing
multilanguage applications.
>> Java encoding convert Unicode characters to 8 bit characters, butr> Sorry, I do not understand you. Firebird does perform the conversion
>> lc_ctype (thus encoding specified for Firebird) *does not make* any
>> conversion. That is an interpretation, not a conversion. When I set
>> lc_ctype to WIN1250, I specify ONLY what characters are allowed in a
>> 8-bit byte array, and how to order the strings with those
>> characters.
>>
>> That is the difference.
r> from the encoding of your connection to encoding of the column. In my
r> test case for Ukrainian, I do pass string to DBMS in WIN1251 encoding
r> and it is correctly converted by the DBMS to UNICODE_FSS.
Yes, because when you write an Unicode string to a stream it is UTF-8
encoded, but that is made by Java, not the DBMS. My experience is
Firebird stores those bytes what you write to it.
>> Ok, but how can I pass a correct unicode string, if encoding isr> Encoding of what? Unicode string has no encoding, or I am wrong? In
>> encoding is different in a column from the connection lc_ctype.
r> correct unicode string, all national charactes have codes the
r> correspond them in unicode table. If you have 0xF5 in unicode string,
r> that is not 0x151 (even if you mean it) and you hardly can expect the
r> Unicode <-> national charset conversion work correctly.
That is the reason why I want to switch it off in same cases.
Same example will make this clear:
I have a column one with WIN1250 and one with WIN1251. One of them
stores Hungarian and the other stores Russian strings. This a real
example from one of my works. How can I pass the Russian string to the
second field, if lc_ctype is set to WIN1250. Answer is there is no way
to do that currently, although it is possible, if I convert that
original Unicode string to bytes (or a string that is encoded with
ISO-8859-1, actually it stores the ASCII characters) AND the JDBC
driver makes no any further translation.
r> Connection encoding is the encoding in what you will be passing the
r> data. DBMS takes all the responsibility to store data in the encoding
r> specified for column. Driver just needs to provide data in the
r> encoding is specified for connection.
>> What I want to make clear is that lc_ctype is _not_ an encoding.r> Where did you read this?
>> That is only a method to make UPPER and ORDER BY work, but that is
>> not an encoding.
r> API Guide says (page 47):
r> "isc_dpb_lc_ctype String specifying the character set to be
r> utilized".
Character set is not an encoding. Character set is small subtype of
valid charactes. Character set is a set of bytes (0x00 - 0xFF) that
are valid in that character set and an interpretation about what
characters does the different bytes mean.
Encoding is an actual conversion. It replaces 16 bit characters with 8
bit bytes.
r> Language Reference (page 277):
r> "A character set defines the symbols that can be entered as text in a
r> column, and its also defines the maximum number of bytes of storage
r> necessary to represent each symbol.... Each character set also has an
r> implicit collation order that specifies how its symbols are sorted
r> and ordered."
Exactly what I said. This is not an encoding.
r> So, lc_ctype _is_ the character set of the client.
Agree, it is, but it still not an encoding. There is conversion in
most of the cases. Maybe it has some for UNICODE_FSS, but not for the
others.
r> Language Reference (page 285):
r> "SET NAMES specifies the character set the server should use when
r> translating data from the database to the client application.
r> Similarly, when the client sends data to the database, the server
r> translates the data from the client's character set to the database's
r> default character set (or the character set for an individual column
r> if it differs from the database's default character set)."
r> SET NAMES is setting the isc_dpb_lc_ctype for the connection. From
r> the citation you can find that server does perform translation from
r> charset of the column to the client's charset.
This is true, but I'm speaking about those cases where I don't want
such a server side translation. I pass the strings exactly in the
format that is required for the given column, and with exception of
UNICODE_FSS that is a set of 8-bit characters, not Unicode characters.
>> Encoding converts ASCII characters to Unicode characters and vicar> Not always (you might try defining the column with ASCII charset
>> versa. If you send 0xF5 to Firebird, it will store 0xF5 in the
>> database.
r> 0..127 and try to write 0xF5). It will accept it only if 0xF5 is
r> allowed in the character set of the connection (NONE, UNICODE_FSS,
r> WIN1251, etc.). But it will try to convert it according to the
r> charset of the database or column and throw an exception if it fails.
r> Therefore you cannot store data through the connection with NONE
r> charset into WIN1250 column, simply because Firebird has no hint how
r> the data you supplied must be converted into WIN1250 charset.
>> Maybe, if you don't set lc_ctype it will say that character is notr> Not specifying lc_ctype in DPB is equal to NONE. If you set NONE as
>> valid, but it won't convert it.
r> the charset, you will be able to read data in WIN1250 columns, but
r> not write.
This a quite serious issue for me. That is my problem.
>> I think now it is clear that the main problem is I can't pass 0xF5r> If you do not know the encoding, how can you use them in database?
>> to the JDBC driver, what you should see is that in many cases there
>> is no way to create a "correct" unicode string, because original
>> encoding is not known.
r> Java does not have "unknown" encoding as well, it has "default" one.
r> What prevents you to use Cp1250 as the default encoding for your JVM?
r> Then you are sure that this is true:
r> new String(new byte[]{(byte)0xf5}).charAt(0) == 0x151
Encoding is not unknown, but in many cases that is ISO-8859-1.
>> Moreover there is absolutely no hope to write data to column withr> Should there be any?
>> differenc character set with the current system.
Yes. Please remember your first encoding test with the Ukrainan text.
That was an Ukrainan text encoded with ISO-8859-1. Many sources in
Java provides strings in this format. There is no way to store that,
although it worked before character conversion was added.
r> In this way you might corrupt data in the
r> database. Connect with correct charset (UNICODE_FSS for example),
r> provide correct data (UTF8 for example) and Firebird will do the rest.
No it won't. I can not use proper UPPER and ORDER BY with UNICODE_FSS.
Of course if it works, blame on me, this argue has no meaning, but
otherwise...
>> +1 vote to add an option to disable character set conversion. :)r> -1 vote not to add such option.
r> But sure, you can add this feature in your driver, like an JVM option
r> (please, do not add a constant to GDS.java because this is an API).
Surely, I don't want to hurt the consistency of that.
r> Best regards,
r> Roman Rokytskyy
r> Yahoo! Groups SponsorADVERTISEMENT
r> To unsubscribe from this group, send an email to:
r> Firebird-Java-unsubscribe@yahoogroups.com
r> Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.