Subject Re: BLOB sub_type 1 and character endcoding
Author Roman Rokytskyy
Rick,

> Those of you more knowledgeable than I am about this please read
> this and correct me where I'm wrong. I'll add this to the FAQ.

Excellent explanation! I have just a few comments to make picture more
complete.

> With Firebird the character encoding of the text data stored in the
> database is set when the database is created. That applies to the
> char and varchar columns, and type 1 blobs (text blobs). You can
> override the default charset for columns with the appropriate SQL
> commands when the columns are created.

> Be careful if you do this or
> you may end up with two columns in the same table that you can't
> read with the same connection.

The only situation when this problem can happen that comes to my mind
is when you have table with columns that have "NONE" character set and
some other character set ("UNICODE_FSS", "WIN1252", etc). As Rick
described, server tries to convert characters from the encoding
specified for the column into the encoding specified for connection.
"NONE" character set allows only one-way conversion: from <any> to
"NONE". In this case server simply returns you bytes written in the
database. So if you have table

CREATE TABLE charset_table(
col1 VARCHAR(10) CHARACTER SET WIN1252,
col2 VARCHAR(10) CHARACTER SET NONE
)

you will not be able to modify both columns in the same SQL statement,
and it does not matter whether you use "NONE", "WIN1252" or
"UNICODE_FSS" for the connection.

The only possible way to solve this problem is to use character set
"OCTETS". This is some kind of artificial character set, similar to
"NONE" (data are written and read as byte arrays), however there exist
bi-directional translation rules between any character set (incl.
"NONE") and "OCTETS". You can specify "OCTETS" for connection and then
decode byte arrays you receive from the server yourself, driver will
do byte-array-to-string conversion incorrectly, since it does not get
a hint about the character set from the server.

> Let's say your program or web app prompts the user to type a string.
> If the user types "abc" into a Java text box in your app, or into a
> text box in a browser for a web app, Java creates a string for that
> 3 character string in Unicode. So the string would actually have 9
> bytes in it- three Unicode characters of three bytes each.

Just a note, as we had this here already before. You have to pass
correct unicode string to the driver. What is "correct unicode"
string? It is easier to explain what is not correct unicode string.

Let's assume you have normal text file in WIN1251 encoding. In this
case cyrillic characters from the unicode table (values between
0-65535) are mapped into the characters with values 0-255. However,
your regional settings say that you're in Germany. This means that
file.encoding will be set to Cp1252 on JVM start. If you now open the
file and construct a reader without specifying that character encoding
is Cp1251, Java will read the file and construct you strings. However
all cyrillic characters will be replaced by characters from the Cp1252
encoding that have the same number representation as cyrillic ones.
These strings are valid unicode string, however their content is not
the content you read from the file. Interesting enough, if you will
write such strings back into the file, and open it in some text editor
saying that this is WIN1251 text, you will see correct text.

> Using UNICODE_FSS works well everywhere but may increase the size of
> your databases if you have lots of text because Unicode uses
> characters up to 3 bytes long.

There are some limitations regarding UNICODE_FSS character set:
there's only one collation, where strings are sorted by the natural
order, and not collation rules for different languages; there are some
issues when converting them to upper case, etc. More information on
these anomalies you can find in Firebird-Support group.

Best regards,
Roman Rokytskyy