Subject | Re: [firebird-support] Re: UTF8 problem with FB 2.1.3 |
---|---|
Author | Paul Vinkenoog |
Post date | 2011-01-06T15:27:40Z |
Markus wrote:
Example:
SQL> connect u:\firebird\databases\pbase.fdb;
Database: u:\firebird\databases\pbase.fdb
SQL> create table utable (s varchar(80) character set utf8);
SQL> commit;
SQL> insert into utable values ('Reißbrett');
Statement failed, SQLSTATE = 22000
Malformed string
-- because the connection charset is NONE!
SQL> set names iso8859_1;
SQL> insert into utable values ('Reißbrett');
Statement failed, SQLSTATE = 22000
Malformed string
-- still no luck: should have SET NAMES before connecting
SQL> insert into utable values (_iso8859_1 'Reißbrett');
-- no error here!
SQL> select s from utable;
S
===============================================================================
Rei+íbrett
To understand what's going on here, you need to know the following:
'ß' has character code 223 (= DF, = 11011111) in both ISO8859_1 and WIN1252.
Its Unicode code point is also 00DF.
UTF8 is an encoding which maps every Unicode code point to a sequence of 1 to 4 bytes.
Pure ASCII characters (00-0F, or 00000000-01111111) are all mapped to the exact same single-byte value in UTF8.
Anything above that maps to a sequence of bytes with the following characteristics:
- The first byte starts with bit pattern 110 if the total length is 2 bytes; 1110 if 3 bytes; 11110 if 4 bytes.
- The continuation bytes all start with 10.
- The remainder of the bits is used to store the Unicode value in.
For 'ß', this means that the UTF8 encoding is C3 9F (notice that at first sight, this looks nothing like the Unicode value 00 DF).
Now, if your connection character set is NONE and the column character set is UTF8, Firebird must convert single byte characters to valid UTF8 encodings. As long as you type unaccented ASCII characters, this is simply a matter of copying them through. But as soon as you enter anything else, Firebird has to know how to determine the character code from the byte value. With character set NONE, this isn't possible. But with ISO8859_1, the server knows: "Ah, I receive a DF byte here, and in ISO8859_1 this means 'ß'. OK, I'll convert it to its two-byte UTF8 encoding."
Telling Firebird that the bytes you are sending it are to be interpreted as ISO8859_1 can be done by:
- SET NAMES ISO8859_1 (before connecting; else reconnect)
- Using introduder syntax (_iso8859_1 'Reißbrett')
Now, if you look at the output from the SELECT on the last line, you see that the ß is shown as two bytes. This is because the connection character set is still NONE (I didn't reconnect after the SET NAMES), so all Firebird can do is send you those two bytes. If you reconnect with ISO8859_1 and repeat the SELECT, you'll get the correct output. Without reconnecting, you can get it with:
select cast(s as varchar(80) character set iso8859_1) from utable;
...which is a pain, of course.
Newer Delphi versions are Unicode-enabled, so they should be capable of displaying everything correctly. With IBExpert, I don't know (don't use it). Im pretty sure that you can set the connection charset, but setting it to ISO8859_1 won't help you with Asian characters. And setting it to a multi-byte charset (like UTF8) won't help if IBExpert doesn't have built-in support for it.
Cheers,
Paul Vinkenoog
> the set name method doesn't work,Sorry, I should have told you that SET NAMES must be issued before you make the connection. And setting it to UTF8 only makes sense if your terminal does indeed produce correct UTF8 character codes. After all, ISQL is a text terminal application; I don't suppose it has any built-in codepage or Unicode magic (I may be wrong though - I'm not a hard-core isql user).
Example:
SQL> connect u:\firebird\databases\pbase.fdb;
Database: u:\firebird\databases\pbase.fdb
SQL> create table utable (s varchar(80) character set utf8);
SQL> commit;
SQL> insert into utable values ('Reißbrett');
Statement failed, SQLSTATE = 22000
Malformed string
-- because the connection charset is NONE!
SQL> set names iso8859_1;
SQL> insert into utable values ('Reißbrett');
Statement failed, SQLSTATE = 22000
Malformed string
-- still no luck: should have SET NAMES before connecting
SQL> insert into utable values (_iso8859_1 'Reißbrett');
-- no error here!
SQL> select s from utable;
S
===============================================================================
Rei+íbrett
To understand what's going on here, you need to know the following:
'ß' has character code 223 (= DF, = 11011111) in both ISO8859_1 and WIN1252.
Its Unicode code point is also 00DF.
UTF8 is an encoding which maps every Unicode code point to a sequence of 1 to 4 bytes.
Pure ASCII characters (00-0F, or 00000000-01111111) are all mapped to the exact same single-byte value in UTF8.
Anything above that maps to a sequence of bytes with the following characteristics:
- The first byte starts with bit pattern 110 if the total length is 2 bytes; 1110 if 3 bytes; 11110 if 4 bytes.
- The continuation bytes all start with 10.
- The remainder of the bits is used to store the Unicode value in.
For 'ß', this means that the UTF8 encoding is C3 9F (notice that at first sight, this looks nothing like the Unicode value 00 DF).
Now, if your connection character set is NONE and the column character set is UTF8, Firebird must convert single byte characters to valid UTF8 encodings. As long as you type unaccented ASCII characters, this is simply a matter of copying them through. But as soon as you enter anything else, Firebird has to know how to determine the character code from the byte value. With character set NONE, this isn't possible. But with ISO8859_1, the server knows: "Ah, I receive a DF byte here, and in ISO8859_1 this means 'ß'. OK, I'll convert it to its two-byte UTF8 encoding."
Telling Firebird that the bytes you are sending it are to be interpreted as ISO8859_1 can be done by:
- SET NAMES ISO8859_1 (before connecting; else reconnect)
- Using introduder syntax (_iso8859_1 'Reißbrett')
Now, if you look at the output from the SELECT on the last line, you see that the ß is shown as two bytes. This is because the connection character set is still NONE (I didn't reconnect after the SET NAMES), so all Firebird can do is send you those two bytes. If you reconnect with ISO8859_1 and repeat the SELECT, you'll get the correct output. Without reconnecting, you can get it with:
select cast(s as varchar(80) character set iso8859_1) from utable;
...which is a pain, of course.
> the type casting one sort of works but when I view the values afterwards using IBExpert personal edition the german umlaut is several chars etc.Like isql with connection charset NONE, IBExpert also shows you the two composing bytes. I suppose that you can set the connection character set to ISO8859_1. Maybe IBExpert even supports UTF8 (I mean *really* supports it, unlike isql, afaik).
> [IBExpert] sort of shows German Umlauts but Äpfel becomes Ä,pfel and chinese/asian characters show up just plain wrong there.If I'm not mistaken, Arial Unicode MS uses UTF16, so this might mess up a UTF8 stream pretty nicely ;-)
> Changing its display font from Tahoma to Arial Unicode MS makes it even worse. So this tool as well as iSQL have to be treated very carefully.
Newer Delphi versions are Unicode-enabled, so they should be capable of displaying everything correctly. With IBExpert, I don't know (don't use it). Im pretty sure that you can set the connection charset, but setting it to ISO8859_1 won't help you with Asian characters. And setting it to a multi-byte charset (like UTF8) won't help if IBExpert doesn't have built-in support for it.
Cheers,
Paul Vinkenoog