firebird-java - Re: [Firebird-Java] Character Sets

Subject	Re: [Firebird-Java] Character Sets
Author	Rick Fincher
Post date	2004-03-23T20:46:42Z

Hi Lucio,

We recently added a FAQ item on this (# 46) but it may not be in your
version of the FAQ, so I included it below.

Hope this helps,

Rick

----- Original Message -----

> Hello everybody.
> Y have a question:
> How can I do a update or insert query with string values, when some
> ones have "latin" characters like: ÑáÁéÉí,etc?
> I must insert the string 'NUÑEZ' in a varchar column, but when I try
> it, I get an SQLException:
>
> org.firebirdsql.jdbc.FBSQLException: GDS Exception. 335544321.
> arithmetic exception, numeric overflow, or string truncation
> Cannot transliterate character between character sets
>
> My table column character set is ISO_8859_1, but I had tried, with
> other ones like: UNICODE_FSS,ASCII,WIN_1252, etc.
>
> Thank you for hour help.

46- Can you explain how character sets work?

return to top

Character handling with JayBird can be confusing because the Java VM,
Firebird database, browser, and JayBird Connection all have a charset
associated with them.

Also, the Firebird server attempts to transliterate the internal charset of
a database to the charset specified in the connection. JayBird also attempts
to translate the JVM charset to the Firebird server charsets specfied in the
connection.

With Firebird the character encoding of the text data stored in the database
is set when the database is created. That applies to the char and varchar
columns, and type 1 blobs (text blobs). You can override the default charset
for columns with the appropriate SQL commands when the columns are created.
Be careful if you do this or you may end up with two columns in the same
table that you can't read with the same connection.

The only situation when this problem can happen is when you have a table
with columns that have the "NONE" character set and some other character set
("UNICODE_FSS", "WIN1252", etc). The server tries to convert characters from
the encoding specified for the column into the encoding specified for
connection. The "NONE" character set allows only one-way conversion: from
<any> to "NONE". In this case server simply returns you bytes written in the
database. So if you have table:

CREATE TABLE charset_table(
col1 VARCHAR(10) CHARACTER SET WIN1252,
col2 VARCHAR(10) CHARACTER SET NONE
)

you will not be able to modify both columns in the same SQL statement, and
it does not matter whether you use "NONE", "WIN1252" or " UNICODE_FSS" for
the connection.

The only possible way to solve this problem is to use character set
"OCTETS". This is some kind of artificial character set, similar to "NONE"
(data are written and read as byte arrays), however there exist
bi-directional translation rules between any character set (incl."NONE") and
"OCTETS". You can specify "OCTETS" for connection and then decode byte
arrays you receive from the server yourself, the driver will do
byte-array-to-string conversion incorrectly, since it does not get a hint
about the character set from the server.

Let's say your program or web app prompts the user to type a string. If the
user types "abc" into a Java text box in your app, or into a text box in a
browser for a web app, Java creates a string for that 3 character string in
Unicode. So the string would actually have 9 bytes in it- three Unicode
characters of three bytes each.

INSERTing this in a text column in the database would result in nine bytes
being inserted without translation.

Note: You have to pass correct unicode strings to the driver. What is
"correct unicode" string? It is easier to explain what is not a correct
unicode string.

Let's assume you have normal text file in WIN1251 encoding. In this case
cyrillic characters from the unicode table (values between 0-65535) are
mapped into the characters with values 0-255. However, your regional
settings say that you're in Germany. This means that file.encoding will be
set to Cp1252 on JVM start. If you now open the file and construct a reader
without specifying that character encoding is Cp1251, Java will read the
file and construct your strings. However all cyrillic characters will be
replaced by characters from the Cp1252 encoding that have the same number
representation as the cyrillic ones.

These strings are valid unicode strings, however their content is not the
content you read from the file. Interestingly enough, if you write such
strings back into the file, and open it in some text editor saying that this
is WIN1251 text, you will see correct text.

If you open your connection to the database without specifying a character
set (lc_ctype) it defaults to NONE and no translation is done. So when you
SELECT the previously inserted data from the database and display it in your
program you get the same string you entered, right? Well, not necessarily.

You will get a string with the same nine bytes in it that were stored, but
if the user getting that string from the database has a different default
charset in his Java VM those bytes will display differently.

The JVM usually picks up its locale dependent character encoding from the
underlying operating system, but it can also be set when you invoke the JVM
by using -Dfile.encoding=Cp1252, for example. If you attempt to display
characters that aren't in your default JVM encoding they apear as '?'.

The only way to insure you always get back what you put in is to create the
database with a charset and set lc_ctype to the same charset when you open
the connection to that database.

If want to use charsets other than NONE and you have lots of data in
databases with a charset of NONE, you may have to set up a new database with
a different charset and use a data pump to transfer data over, or write a
small program to do the transfer.

Using UNICODE_FSS works well nearly everywhere but may increase the size of
your databases if you have lots of text because Unicode uses characters up
to 3 bytes long.

There are some limitations regarding UNICODE_FSS character set: there's only
one collation, where strings are sorted by the natural order, and not
collation rules for different languages; there are some issues when
converting them to upper case, etc. More information on these anomalies can
be found in the Firebird-Support group.

Again, the default charset for Firebird is NONE. The Firebird server does no
translation with this charset.

If your database has a charset of NONE and you set a charset type on the
connection (lc_ctype) that is not NONE, you can write to the database but
you can't read from it without getting the "Cannot transliterate between
character sets" exception.

Let's follow a string as it gets inserted into the database and later
selected from the database. For this example we will set the database
charset to NONE.

See the freely available Interbase 6 PDF manuals "Data Definition Guide" for
a list of charsets available.

The WIN125X or ISO8859_1 charsets may be a good choice for you if you need
the non-English characters but want the compactness of the 1 byte
characters. With these char sets you can specify many different national
language collation orders in the ORDER BY clause of the SELECT statement.

Let's look at the same example above, but this time we will insert into a
database that has been created with a charset of WIN1251.

When you open the connection to the database you set the lc_ctype=WIN1251.
Then insert the string 'abc' into the appropriate column. JayBird has to
take the Unicode encoded Java String "abc"and convert it to WIN1251 format
and send it to the database server for insertion. Since the database is
already in WIN1251 format, the server does not have to translate. When the
string is read back from the database it is converted back to the Java VM
format by JayBird.

It is also possible to set an lc_ctype in a connection that is different
from the charset of the database. This lets the database server do the
translating from one charset to another. This is a feature of the Firebird
server that lets programming languages or programs that require specific
character formats to connect to the database without requiring the data to
be stored in that format.

You can also avoid problems by using java.sql.PreparedStatement instead of
java.sql.Statement and not building SQL strings out of concatenated Java
strings. For example:

String sqlString, firstName="John", lastName="O'Neal";

sqlString = "INSERT INTO nameTable (LNAME, FNAME)
VALUES('"+lastName+"','"+firstName+"')";

Statement stmt = connection.createStatement();
int insertedRows = stmt.executeUpdate(sqlString);

The problem here is that if the user types in data for these strings you
might end up with illegal characters, or the translation might not be
correct.

In the example above, the following illegal SQL string would be generated
and cause an exception to be thrown because of the apostrophe in O'Neil:

INSERT INTO nameTable (LNAME, FNAME) VALUES('O'Neal', 'John')

To avoid this, use a prepared statement like in the example below.

PreparedStatement stmt = connection.prepareStatement("INSERT INTO
nameTable(LNAME,FNAME) VALUES(?, ?)");

stmt.setString(1, lastName);
stmt.setString(2, firstName);

int insertedRows = stmt.executeUpdate();

if (insertedRows != 1)
throw new MyInsertFailedException("Could not insert data");