Subject | Re: [Firebird-Java] RE: Problems reading diacritic characters |
---|---|
Author | Rick Fincher |
Post date | 2004-03-30T22:43:20Z |
Hi Prabhakar,
The following is the FAQ entry on charaxter sets:
46- Can you explain how character sets work?
return to top <cid:part1.09000608.09090904@...>
Character handling with JayBird can be confusing because the Java VM,
Firebird database, browser, and JayBird Connection all have a charset
associated with them.
Also, the Firebird server attempts to transliterate the internal charset
of a database to the charset specified in the connection. JayBird also
attempts to translate the JVM charset to the Firebird server charsets
specfied in the connection.
With Firebird the character encoding of the text data stored in the
database is set when the database is created. That applies to the char
and varchar columns, and type 1 blobs (text blobs). You can override the
default charset for columns with the appropriate SQL commands when the
columns are created. Be careful if you do this or you may end up with
two columns in the same table that you can't read with the same connection.
The only situation when this problem can happen is when you have a table
with columns that have the "NONE" character set and some other character
set ("UNICODE_FSS", "WIN1252", etc). The server tries to convert
characters from the encoding specified for the column into the encoding
specified for connection. The "NONE" character set allows only one-way
conversion: from <any> to "NONE". In this case server simply returns you
bytes written in the database. So if you have table:
CREATE TABLE charset_table(
col1 VARCHAR(10) CHARACTER SET WIN1252,
col2 VARCHAR(10) CHARACTER SET NONE
)
you will not be able to modify both columns in the same SQL statement,
and it does not matter whether you use "NONE", "WIN1252" or "
UNICODE_FSS" for the connection.
The only possible way to solve this problem is to use character set
"OCTETS". This is some kind of artificial character set, similar to
"NONE" (data are written and read as byte arrays), however there exist
bi-directional translation rules between any character set (incl."NONE")
and "OCTETS". You can specify "OCTETS" for connection and then decode
byte arrays you receive from the server yourself, the driver will do
byte-array-to-string conversion incorrectly, since it does not get a
hint about the character set from the server.
Let's say your program or web app prompts the user to type a string. If
the user types "abc" into a Java text box in your app, or into a text
box in a browser for a web app, Java creates a string for that 3
character string in Unicode. So the string would actually have 9 bytes
in it- three Unicode characters of three bytes each.
INSERTing this in a text column in the database would result in nine
bytes being inserted without translation.
Note: You have to pass correct unicode strings to the driver. What is
"correct unicode" string? It is easier to explain what is not a correct
unicode string.
Let's assume you have normal text file in WIN1251 encoding. In this case
cyrillic characters from the unicode table (values between 0-65535) are
mapped into the characters with values 0-255. However, your regional
settings say that you're in Germany. This means that file.encoding will
be set to Cp1252 on JVM start. If you now open the file and construct a
reader without specifying that character encoding is Cp1251, Java will
read the file and construct your strings. However all cyrillic
characters will be replaced by characters from the Cp1252 encoding that
have the same number representation as the cyrillic ones.
These strings are valid unicode strings, however their content is not
the content you read from the file. Interestingly enough, if you write
such strings back into the file, and open it in some text editor saying
that this is WIN1251 text, you will see correct text.
If you open your connection to the database without specifying a
character set (lc_ctype) it defaults to NONE and no translation is done.
So when you SELECT the previously inserted data from the database and
display it in your program you get the same string you entered, right?
Well, not necessarily.
You will get a string with the same nine bytes in it that were stored,
but if the user getting that string from the database has a different
default charset in his Java VM those bytes will display differently.
The JVM usually picks up its locale dependent character encoding from
the underlying operating system, but it can also be set when you invoke
the JVM by using -Dfile.encoding=Cp1252, for example. If you attempt to
display characters that aren't in your default JVM encoding they apear
as '?'.
The only way to insure you always get back what you put in is to create
the database with a charset and set lc_ctype to the same charset when
you open the connection to that database.
If want to use charsets other than NONE and you have lots of data in
databases with a charset of NONE, you may have to set up a new database
with a different charset and use a data pump to transfer data over, or
write a small program to do the transfer.
Using UNICODE_FSS works well nearly everywhere but may increase the size
of your databases if you have lots of text because Unicode uses
characters up to 3 bytes long.
There are some limitations regarding UNICODE_FSS character set: there's
only one collation, where strings are sorted by the natural order, and
not collation rules for different languages; there are some issues when
converting them to upper case, etc. More information on these anomalies
can be found in the Firebird-Support group.
Again, the default charset for Firebird is NONE. The Firebird server
does no translation with this charset.
If your database has a charset of NONE and you set a charset type on the
connection (lc_ctype) that is not NONE, you can write to the database
but you can't read from it without getting the "Cannot transliterate
between character sets" exception.
Let's follow a string as it gets inserted into the database and later
selected from the database. For this example we will set the database
charset to NONE.
See the freely available Interbase 6 PDF manuals "Data Definition Guide"
for a list of charsets available.
The WIN125X or ISO8859_1 charsets may be a good choice for you if you
need the non-English characters but want the compactness of the 1 byte
characters. With these char sets you can specify many different national
language collation orders in the ORDER BY clause of the SELECT statement.
Let's look at the same example above, but this time we will insert into
a database that has been created with a charset of WIN1251.
When you open the connection to the database you set the
lc_ctype=WIN1251. Then insert the string 'abc' into the appropriate
column. JayBird has to take the Unicode encoded Java String "abc"and
convert it to WIN1251 format and send it to the database server for
insertion. Since the database is already in WIN1251 format, the server
does not have to translate. When the string is read back from the
database it is converted back to the Java VM format by JayBird.
It is also possible to set an lc_ctype in a connection that is different
from the charset of the database. This lets the database server do the
translating from one charset to another. This is a feature of the
Firebird server that lets programming languages or programs that require
specific character formats to connect to the database without requiring
the data to be stored in that format.
You can also avoid problems by using java.sql.PreparedStatement instead
of java.sql.Statement and not building SQL strings out of concatenated
Java strings. For example:
String sqlString, firstName="John", lastName="O'Neal";
sqlString = "INSERT INTO nameTable (LNAME, FNAME)
VALUES('"+lastName+"','"+firstName+"')";
Statement stmt = connection.createStatement();
int insertedRows = stmt.executeUpdate(sqlString);
The problem here is that if the user types in data for these strings you
might end up with illegal characters, or the translation might not be
correct.
In the example above, the following illegal SQL string would be
generated and cause an exception to be thrown because of the apostrophe
in O'Neil:
INSERT INTO nameTable (LNAME, FNAME) VALUES('O'Neal', 'John')
To avoid this, use a prepared statement like in the example below.
PreparedStatement stmt = connection.prepareStatement("INSERT INTO
nameTable(LNAME,FNAME) VALUES(?, ?)");
stmt.setString(1, lastName);
stmt.setString(2, firstName);
int insertedRows = stmt.executeUpdate();
if (insertedRows != 1)
throw new MyInsertFailedException("Could not insert data");
------
Prabhakar Krishnaswami wrote:
The following is the FAQ entry on charaxter sets:
46- Can you explain how character sets work?
return to top <cid:part1.09000608.09090904@...>
Character handling with JayBird can be confusing because the Java VM,
Firebird database, browser, and JayBird Connection all have a charset
associated with them.
Also, the Firebird server attempts to transliterate the internal charset
of a database to the charset specified in the connection. JayBird also
attempts to translate the JVM charset to the Firebird server charsets
specfied in the connection.
With Firebird the character encoding of the text data stored in the
database is set when the database is created. That applies to the char
and varchar columns, and type 1 blobs (text blobs). You can override the
default charset for columns with the appropriate SQL commands when the
columns are created. Be careful if you do this or you may end up with
two columns in the same table that you can't read with the same connection.
The only situation when this problem can happen is when you have a table
with columns that have the "NONE" character set and some other character
set ("UNICODE_FSS", "WIN1252", etc). The server tries to convert
characters from the encoding specified for the column into the encoding
specified for connection. The "NONE" character set allows only one-way
conversion: from <any> to "NONE". In this case server simply returns you
bytes written in the database. So if you have table:
CREATE TABLE charset_table(
col1 VARCHAR(10) CHARACTER SET WIN1252,
col2 VARCHAR(10) CHARACTER SET NONE
)
you will not be able to modify both columns in the same SQL statement,
and it does not matter whether you use "NONE", "WIN1252" or "
UNICODE_FSS" for the connection.
The only possible way to solve this problem is to use character set
"OCTETS". This is some kind of artificial character set, similar to
"NONE" (data are written and read as byte arrays), however there exist
bi-directional translation rules between any character set (incl."NONE")
and "OCTETS". You can specify "OCTETS" for connection and then decode
byte arrays you receive from the server yourself, the driver will do
byte-array-to-string conversion incorrectly, since it does not get a
hint about the character set from the server.
Let's say your program or web app prompts the user to type a string. If
the user types "abc" into a Java text box in your app, or into a text
box in a browser for a web app, Java creates a string for that 3
character string in Unicode. So the string would actually have 9 bytes
in it- three Unicode characters of three bytes each.
INSERTing this in a text column in the database would result in nine
bytes being inserted without translation.
Note: You have to pass correct unicode strings to the driver. What is
"correct unicode" string? It is easier to explain what is not a correct
unicode string.
Let's assume you have normal text file in WIN1251 encoding. In this case
cyrillic characters from the unicode table (values between 0-65535) are
mapped into the characters with values 0-255. However, your regional
settings say that you're in Germany. This means that file.encoding will
be set to Cp1252 on JVM start. If you now open the file and construct a
reader without specifying that character encoding is Cp1251, Java will
read the file and construct your strings. However all cyrillic
characters will be replaced by characters from the Cp1252 encoding that
have the same number representation as the cyrillic ones.
These strings are valid unicode strings, however their content is not
the content you read from the file. Interestingly enough, if you write
such strings back into the file, and open it in some text editor saying
that this is WIN1251 text, you will see correct text.
If you open your connection to the database without specifying a
character set (lc_ctype) it defaults to NONE and no translation is done.
So when you SELECT the previously inserted data from the database and
display it in your program you get the same string you entered, right?
Well, not necessarily.
You will get a string with the same nine bytes in it that were stored,
but if the user getting that string from the database has a different
default charset in his Java VM those bytes will display differently.
The JVM usually picks up its locale dependent character encoding from
the underlying operating system, but it can also be set when you invoke
the JVM by using -Dfile.encoding=Cp1252, for example. If you attempt to
display characters that aren't in your default JVM encoding they apear
as '?'.
The only way to insure you always get back what you put in is to create
the database with a charset and set lc_ctype to the same charset when
you open the connection to that database.
If want to use charsets other than NONE and you have lots of data in
databases with a charset of NONE, you may have to set up a new database
with a different charset and use a data pump to transfer data over, or
write a small program to do the transfer.
Using UNICODE_FSS works well nearly everywhere but may increase the size
of your databases if you have lots of text because Unicode uses
characters up to 3 bytes long.
There are some limitations regarding UNICODE_FSS character set: there's
only one collation, where strings are sorted by the natural order, and
not collation rules for different languages; there are some issues when
converting them to upper case, etc. More information on these anomalies
can be found in the Firebird-Support group.
Again, the default charset for Firebird is NONE. The Firebird server
does no translation with this charset.
If your database has a charset of NONE and you set a charset type on the
connection (lc_ctype) that is not NONE, you can write to the database
but you can't read from it without getting the "Cannot transliterate
between character sets" exception.
Let's follow a string as it gets inserted into the database and later
selected from the database. For this example we will set the database
charset to NONE.
See the freely available Interbase 6 PDF manuals "Data Definition Guide"
for a list of charsets available.
The WIN125X or ISO8859_1 charsets may be a good choice for you if you
need the non-English characters but want the compactness of the 1 byte
characters. With these char sets you can specify many different national
language collation orders in the ORDER BY clause of the SELECT statement.
Let's look at the same example above, but this time we will insert into
a database that has been created with a charset of WIN1251.
When you open the connection to the database you set the
lc_ctype=WIN1251. Then insert the string 'abc' into the appropriate
column. JayBird has to take the Unicode encoded Java String "abc"and
convert it to WIN1251 format and send it to the database server for
insertion. Since the database is already in WIN1251 format, the server
does not have to translate. When the string is read back from the
database it is converted back to the Java VM format by JayBird.
It is also possible to set an lc_ctype in a connection that is different
from the charset of the database. This lets the database server do the
translating from one charset to another. This is a feature of the
Firebird server that lets programming languages or programs that require
specific character formats to connect to the database without requiring
the data to be stored in that format.
You can also avoid problems by using java.sql.PreparedStatement instead
of java.sql.Statement and not building SQL strings out of concatenated
Java strings. For example:
String sqlString, firstName="John", lastName="O'Neal";
sqlString = "INSERT INTO nameTable (LNAME, FNAME)
VALUES('"+lastName+"','"+firstName+"')";
Statement stmt = connection.createStatement();
int insertedRows = stmt.executeUpdate(sqlString);
The problem here is that if the user types in data for these strings you
might end up with illegal characters, or the translation might not be
correct.
In the example above, the following illegal SQL string would be
generated and cause an exception to be thrown because of the apostrophe
in O'Neil:
INSERT INTO nameTable (LNAME, FNAME) VALUES('O'Neal', 'John')
To avoid this, use a prepared statement like in the example below.
PreparedStatement stmt = connection.prepareStatement("INSERT INTO
nameTable(LNAME,FNAME) VALUES(?, ?)");
stmt.setString(1, lastName);
stmt.setString(2, firstName);
int insertedRows = stmt.executeUpdate();
if (insertedRows != 1)
throw new MyInsertFailedException("Could not insert data");
------
Prabhakar Krishnaswami wrote:
>We had some code something like this
>
> sqlBuffer.append("_UNICODE_FSS '");
> if (temp != null)
> {
> sqlBuffer.append(escapeSQLQuotes(temp));
> }
> sqlBuffer.append("'");
>
>Here temp contains the string to be appended. We insert this string into
>the database.
>
>Prabhakar Krishnaswami
>
>Software Engineer
>
>
>
>Dynix
>
>pkrishnaswami@...
>
>p 801-223-5817
>
>f 801-223-5202
>
>www.dynix.com
>
>
>-----Original Message-----
>From: Roman Rokytskyy [mailto:rrokytskyy@...]
>Sent: Tuesday, March 30, 2004 1:31 PM
>To: Firebird-Java@yahoogroups.com
>Subject: [Firebird-Java] Re: Unnecessary padding to Strings read
>
>
>
>>We have had some problems storing strings with diacritic characters
>>in interbase database using firebird driver v1.5 on JBoss 320
>>environment on Redhat Linux Adverserver server 3.0. We saved the
>>data prepending _UNICODE_FSS to strings.
>>
>>
>
>What do you mean by prepending? Something like this:
>
>stmt.setString(1, "_UNICODE_FSS" + myStr);
>
>Where did you get such idea?
>
>
>
>
>>When we later displayed
>>them in the browser, the strings displayed garbage.
>>
>>
>
>Please read our FAQ about using national characters.
>
>
>
>>When then
>>removed _UNICODE_FSS and used the parameter lc_ctype=UNICODE_FSS
>>while connecting to the database.
>>That didn't change anything.
>>
>>Can somebody help me with this issue?
>>
>>
>
>Can you create example code that will show exactly what you did
>(including database creation script)?
>
>Roman
>
>
>