Subject | Re: [Firebird-Java] BLOB sub_type 1 and character endcoding |
---|---|
Author | Rick Fincher |
Post date | 2003-12-06T03:25:41Z |
Hi Niki,
This has been confusing to a lot of users, including me. Let me attempt to
explain character handling and maybe that will help.
Those of you more knowledgeable than I am about this please read this and
correct me where I'm wrong. I'll add this to the FAQ.
Character handling with JayBird can be confusing because the Java VM,
Firebird database, browser, and JayBird Connection all have a charset
associated with them.
Also, the Firebird server attempts to transliterate the internal charset of
a database to the charset specified in the connection. JayBird also
attempts to translate the JVM charset to the Firebird server charsets
specfied in the connection.
With Firebird the character encoding of the text data stored in the database
is set when the database is created. That applies to the char and varchar
columns, and type 1 blobs (text blobs). You can override the default
charset for columns with the appropriate SQL commands when the columns are
created. Be careful if you do this or you may end up with two columns in
the same table that you can't read with the same connection.
The default charset for Firebird is NONE. The Firebird server does no
translation with this charset.
If your database has a charset of NONE and you set a charset type on the
connection (lc_ctype) that is not NONE, you can write to the database but
you can't read from it without getting the "Cannot transliterate between
character sets" exception.
Let's follow a string as it gets inserted into the database and later
selected from the database. For this example we will set the database
charset to NONE.
Let's say your program or web app prompts the user to type a string. If the
user types "abc" into a Java text box in your app, or into a text box in a
browser for a web app, Java creates a string for that 3 character string in
Unicode. So the string would actually have 9 bytes in it- three Unicode
characters of three bytes each.
INSERTing this in a text column in the database would result in nine bytes
being inserted without translation.
If you open your connection to the database without specifying a character
set (lc_ctype) it defaults to NONE and no translation is done. So when you
SELECT the previously inserted data from the database and display it in your
program you get the same string you entered, right? Well, not necessarily.
You will get a string with the same nine bytes in it that were stored, but
if the user getting that string from the datbase has a different default
charset in his Java VM those bytes will display differently.
The JVM usually picks up its locale dependent character encoding from the
underlying operating system, but it can also be set when you invoke the JVM
by using -Dfile.encoding=Cp1252, for example. If you attempt to display
characters that aren't in your default JVM encoding they apear as '?'.
The only way to insure you always get back what you put in is to create the
database with a charset and set lc_ctype to the same charset when you open
the connection to that database.
If want to use charsets other than NONE and you have lots of data in
databases with a charset of NONE, you may have to set up a new database with
a different charset and use a data pump to transfer data over, or write a
small program to do the transfer.
Using UNICODE_FSS works well everywhere but may increase the size of your
databases if you have lots of text because Unicode uses characters up to 3
bytes long.
See the freely available Interbase 6 PDF manuals "Data Definition Guide" for
a list of charsets available.
The WIN125X or ISO8859_1 charsets may be a good choice for you if you need
the non-English characters but want the compactness of the 1 byte
characters. You can specify many different national ;anguage collation
orders in the ORDER BY clause of the SELECT statement.
Let's look at the same example above, but this time we will insert into a
database that has been created with a charset of WIN1251.
When you open the connection to the database you set the lc_ctype=WIN1251.
Then insert the string 'abc' into the appropriate column. JayBird has to
take the Unicode encoded Java String "abc"and convert it to WIN1251 format
and send it to the database server for insertion. Since the database is
already in WIN1251 format, the server does not have to translate. When the
string is read back from the database it is converted back to the Java VM
format by JayBird.
It is also possible to set an lc_ctype in a connection that is different
from the charset of the database. This lets the database server do the
translating from one charset to another. This is a feature of the Firebird
server that lets programming languages or programs that require specific
character formats to connect to the database without requiring the data to
be stored in that format.
You can also avoid problems by using java.sql.PreparedStatement instead of
java.sql.Statement and not building SQL strings out of concatenated Java
strings. For example:
String sqlString, firstName="John", lastName="O'Neal";
sqlString = "INSERT INTO nameTable (LNAME, FNAME)
VALUES('"+lastName+"','"+firstName+"')";
Statement stmt = connection.createStatement();
int insertedRows = stmt.executeUpdate(sqlString);
The problem here is that if the user types in data for these strings you
might end up with illegal characters, or the translation might not be
correct.
In the example above, the following illegal SQL string would be generated
and cause an exception to be thrown because of the apostrophe in O'Neil:
INSERT INTO nameTable (LNAME, FNAME) VALUES('O'Neal', 'John')
To avoid this use a prepared statement like in the example below.
PreparedStatement stmt = connection.prepareStatement("INSERT INTO
nameTable(LNAME,FNAME) VALUES(?, ?)");
stmt.setString(1, lastName);
stmt.setString(2, firstName);
int insertedRows = stmt.executeUpdate();
if (insertedRows != 1)
throw new MyInsertFailedException("Could not insert data");
Hope this helps,
Rick
This has been confusing to a lot of users, including me. Let me attempt to
explain character handling and maybe that will help.
Those of you more knowledgeable than I am about this please read this and
correct me where I'm wrong. I'll add this to the FAQ.
Character handling with JayBird can be confusing because the Java VM,
Firebird database, browser, and JayBird Connection all have a charset
associated with them.
Also, the Firebird server attempts to transliterate the internal charset of
a database to the charset specified in the connection. JayBird also
attempts to translate the JVM charset to the Firebird server charsets
specfied in the connection.
With Firebird the character encoding of the text data stored in the database
is set when the database is created. That applies to the char and varchar
columns, and type 1 blobs (text blobs). You can override the default
charset for columns with the appropriate SQL commands when the columns are
created. Be careful if you do this or you may end up with two columns in
the same table that you can't read with the same connection.
The default charset for Firebird is NONE. The Firebird server does no
translation with this charset.
If your database has a charset of NONE and you set a charset type on the
connection (lc_ctype) that is not NONE, you can write to the database but
you can't read from it without getting the "Cannot transliterate between
character sets" exception.
Let's follow a string as it gets inserted into the database and later
selected from the database. For this example we will set the database
charset to NONE.
Let's say your program or web app prompts the user to type a string. If the
user types "abc" into a Java text box in your app, or into a text box in a
browser for a web app, Java creates a string for that 3 character string in
Unicode. So the string would actually have 9 bytes in it- three Unicode
characters of three bytes each.
INSERTing this in a text column in the database would result in nine bytes
being inserted without translation.
If you open your connection to the database without specifying a character
set (lc_ctype) it defaults to NONE and no translation is done. So when you
SELECT the previously inserted data from the database and display it in your
program you get the same string you entered, right? Well, not necessarily.
You will get a string with the same nine bytes in it that were stored, but
if the user getting that string from the datbase has a different default
charset in his Java VM those bytes will display differently.
The JVM usually picks up its locale dependent character encoding from the
underlying operating system, but it can also be set when you invoke the JVM
by using -Dfile.encoding=Cp1252, for example. If you attempt to display
characters that aren't in your default JVM encoding they apear as '?'.
The only way to insure you always get back what you put in is to create the
database with a charset and set lc_ctype to the same charset when you open
the connection to that database.
If want to use charsets other than NONE and you have lots of data in
databases with a charset of NONE, you may have to set up a new database with
a different charset and use a data pump to transfer data over, or write a
small program to do the transfer.
Using UNICODE_FSS works well everywhere but may increase the size of your
databases if you have lots of text because Unicode uses characters up to 3
bytes long.
See the freely available Interbase 6 PDF manuals "Data Definition Guide" for
a list of charsets available.
The WIN125X or ISO8859_1 charsets may be a good choice for you if you need
the non-English characters but want the compactness of the 1 byte
characters. You can specify many different national ;anguage collation
orders in the ORDER BY clause of the SELECT statement.
Let's look at the same example above, but this time we will insert into a
database that has been created with a charset of WIN1251.
When you open the connection to the database you set the lc_ctype=WIN1251.
Then insert the string 'abc' into the appropriate column. JayBird has to
take the Unicode encoded Java String "abc"and convert it to WIN1251 format
and send it to the database server for insertion. Since the database is
already in WIN1251 format, the server does not have to translate. When the
string is read back from the database it is converted back to the Java VM
format by JayBird.
It is also possible to set an lc_ctype in a connection that is different
from the charset of the database. This lets the database server do the
translating from one charset to another. This is a feature of the Firebird
server that lets programming languages or programs that require specific
character formats to connect to the database without requiring the data to
be stored in that format.
You can also avoid problems by using java.sql.PreparedStatement instead of
java.sql.Statement and not building SQL strings out of concatenated Java
strings. For example:
String sqlString, firstName="John", lastName="O'Neal";
sqlString = "INSERT INTO nameTable (LNAME, FNAME)
VALUES('"+lastName+"','"+firstName+"')";
Statement stmt = connection.createStatement();
int insertedRows = stmt.executeUpdate(sqlString);
The problem here is that if the user types in data for these strings you
might end up with illegal characters, or the translation might not be
correct.
In the example above, the following illegal SQL string would be generated
and cause an exception to be thrown because of the apostrophe in O'Neil:
INSERT INTO nameTable (LNAME, FNAME) VALUES('O'Neal', 'John')
To avoid this use a prepared statement like in the example below.
PreparedStatement stmt = connection.prepareStatement("INSERT INTO
nameTable(LNAME,FNAME) VALUES(?, ?)");
stmt.setString(1, lastName);
stmt.setString(2, firstName);
int insertedRows = stmt.executeUpdate();
if (insertedRows != 1)
throw new MyInsertFailedException("Could not insert data");
Hope this helps,
Rick
----- Original Message -----
Hello all.
I have many columns which a large portions of text.
How can I supply character encoding to those type, now each character which
is > 127 is displayed with a square
Regards
Niki