Subject Re: [Firebird-Java] Re: lc_ctype support
Author Marczisovszky Daniel
r> Why do you need to pass 0xF5 to driver? You have to pass 0x151 that
r> is obtained by

r> unicodeStr = new String(win1250Str.getBytes(), "Cp1250");

Because this requires character conversions all over the source code,
moreover I can not pass 0xF5 to a column with WIN1250 encoding where
the lc_ctype for the connection is let's say WIN1252. Different
character sets in Firebird is quite useful for developing
multilanguage applications.

>> Java encoding convert Unicode characters to 8 bit characters, but
>> lc_ctype (thus encoding specified for Firebird) *does not make* any
>> conversion. That is an interpretation, not a conversion. When I set
>> lc_ctype to WIN1250, I specify ONLY what characters are allowed in a
>> 8-bit byte array, and how to order the strings with those
>> characters.
>>
>> That is the difference.

r> Sorry, I do not understand you. Firebird does perform the conversion
r> from the encoding of your connection to encoding of the column. In my
r> test case for Ukrainian, I do pass string to DBMS in WIN1251 encoding
r> and it is correctly converted by the DBMS to UNICODE_FSS.

Yes, because when you write an Unicode string to a stream it is UTF-8
encoded, but that is made by Java, not the DBMS. My experience is
Firebird stores those bytes what you write to it.

>> Ok, but how can I pass a correct unicode string, if encoding is
>> encoding is different in a column from the connection lc_ctype.

r> Encoding of what? Unicode string has no encoding, or I am wrong? In
r> correct unicode string, all national charactes have codes the
r> correspond them in unicode table. If you have 0xF5 in unicode string,
r> that is not 0x151 (even if you mean it) and you hardly can expect the
r> Unicode <-> national charset conversion work correctly.

That is the reason why I want to switch it off in same cases.


Same example will make this clear:

I have a column one with WIN1250 and one with WIN1251. One of them
stores Hungarian and the other stores Russian strings. This a real
example from one of my works. How can I pass the Russian string to the
second field, if lc_ctype is set to WIN1250. Answer is there is no way
to do that currently, although it is possible, if I convert that
original Unicode string to bytes (or a string that is encoded with
ISO-8859-1, actually it stores the ASCII characters) AND the JDBC
driver makes no any further translation.


r> Connection encoding is the encoding in what you will be passing the
r> data. DBMS takes all the responsibility to store data in the encoding
r> specified for column. Driver just needs to provide data in the
r> encoding is specified for connection.

>> What I want to make clear is that lc_ctype is _not_ an encoding.
>> That is only a method to make UPPER and ORDER BY work, but that is
>> not an encoding.

r> Where did you read this?

r> API Guide says (page 47):

r> "isc_dpb_lc_ctype String specifying the character set to be
r> utilized".

Character set is not an encoding. Character set is small subtype of
valid charactes. Character set is a set of bytes (0x00 - 0xFF) that
are valid in that character set and an interpretation about what
characters does the different bytes mean.

Encoding is an actual conversion. It replaces 16 bit characters with 8
bit bytes.

r> Language Reference (page 277):

r> "A character set defines the symbols that can be entered as text in a
r> column, and its also defines the maximum number of bytes of storage
r> necessary to represent each symbol.... Each character set also has an
r> implicit collation order that specifies how its symbols are sorted
r> and ordered."

Exactly what I said. This is not an encoding.

r> So, lc_ctype _is_ the character set of the client.

Agree, it is, but it still not an encoding. There is conversion in
most of the cases. Maybe it has some for UNICODE_FSS, but not for the
others.

r> Language Reference (page 285):

r> "SET NAMES specifies the character set the server should use when
r> translating data from the database to the client application.
r> Similarly, when the client sends data to the database, the server
r> translates the data from the client's character set to the database's
r> default character set (or the character set for an individual column
r> if it differs from the database's default character set)."

r> SET NAMES is setting the isc_dpb_lc_ctype for the connection. From
r> the citation you can find that server does perform translation from
r> charset of the column to the client's charset.

This is true, but I'm speaking about those cases where I don't want
such a server side translation. I pass the strings exactly in the
format that is required for the given column, and with exception of
UNICODE_FSS that is a set of 8-bit characters, not Unicode characters.

>> Encoding converts ASCII characters to Unicode characters and vica
>> versa. If you send 0xF5 to Firebird, it will store 0xF5 in the
>> database.

r> Not always (you might try defining the column with ASCII charset
r> 0..127 and try to write 0xF5). It will accept it only if 0xF5 is
r> allowed in the character set of the connection (NONE, UNICODE_FSS,
r> WIN1251, etc.). But it will try to convert it according to the
r> charset of the database or column and throw an exception if it fails.
r> Therefore you cannot store data through the connection with NONE
r> charset into WIN1250 column, simply because Firebird has no hint how
r> the data you supplied must be converted into WIN1250 charset.

>> Maybe, if you don't set lc_ctype it will say that character is not
>> valid, but it won't convert it.

r> Not specifying lc_ctype in DPB is equal to NONE. If you set NONE as
r> the charset, you will be able to read data in WIN1250 columns, but
r> not write.

This a quite serious issue for me. That is my problem.

>> I think now it is clear that the main problem is I can't pass 0xF5
>> to the JDBC driver, what you should see is that in many cases there
>> is no way to create a "correct" unicode string, because original
>> encoding is not known.

r> If you do not know the encoding, how can you use them in database?
r> Java does not have "unknown" encoding as well, it has "default" one.
r> What prevents you to use Cp1250 as the default encoding for your JVM?
r> Then you are sure that this is true:

r> new String(new byte[]{(byte)0xf5}).charAt(0) == 0x151

Encoding is not unknown, but in many cases that is ISO-8859-1.

>> Moreover there is absolutely no hope to write data to column with
>> differenc character set with the current system.

r> Should there be any?

Yes. Please remember your first encoding test with the Ukrainan text.
That was an Ukrainan text encoded with ISO-8859-1. Many sources in
Java provides strings in this format. There is no way to store that,
although it worked before character conversion was added.

r> In this way you might corrupt data in the
r> database. Connect with correct charset (UNICODE_FSS for example),
r> provide correct data (UTF8 for example) and Firebird will do the rest.

No it won't. I can not use proper UPPER and ORDER BY with UNICODE_FSS.
Of course if it works, blame on me, this argue has no meaning, but
otherwise...

>> +1 vote to add an option to disable character set conversion. :)

r> -1 vote not to add such option.

r> But sure, you can add this feature in your driver, like an JVM option
r> (please, do not add a constant to GDS.java because this is an API).

Surely, I don't want to hurt the consistency of that.

r> Best regards,
r> Roman Rokytskyy


r> Yahoo! Groups SponsorADVERTISEMENT

r> To unsubscribe from this group, send an email to:
r> Firebird-Java-unsubscribe@yahoogroups.com



r> Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.