Subject Re: [firebird-support] understanding characters sets
Author Adriano dos Santos Fernandes
Helen Borrie escreveu:
>
> A.
> Strings are written to the database using the defined character set, which will be either the default character set defined for the database or, if present, the character set defined for the column they are written to.

The default character set is used *only* in DDL commands to set
character set of columns not explicit specified.

> For string data to be transliterated correctly for both writing and reading, the connection character set must be the same as the destination character set.
>
The connection charset exist just to transliterate correctly data from
client to others places (table columns, procedure parameters, etc) when
client and column charset are not equal. If they're equal, NONE will
also works.

> A string is "well-formed" if every byte represents unambiguously a character or part of a character in the character set defined for the data. The character encoding conventions adopted for Firebird and enforced in v.2.1 will detect anomalies in strings and throw the "Malformed string" exception.
>
> If legacy data was written to a non-ASCII character set with NONE specified as the connection character set (the default if it is not specified) then the data would have been written as raw bytes. Under the enforced well-formedness conditions of Firebird 2.1, querying that data causes Firebird to detect the anomalies and throw the exception.
>
> Q.
> How does transliteration work?
> A.
>
> Data coming through the connection behaves just like a column in the database that has been defined to use a particular character set. Thus, the assignment of any constant value coming through the connection to a database column is virtually the same as the assignment of a column value to a another column within the database.
>
> The NONE character set is just raw bytes. OCTETS -- which is not available as a connection character set -- is also just raw bytes. The difference between NONE and OCTETS is only in the character that is used to represent a "blank": for OCTETS it is ASCII Null (0x00, CHR(0)) while for NONE it is ASCII Space (0x20, CHR(32)).
>
> Assignment of a string, whether from a connection "column" or a database column, occurs as follows:
>
> -- from a column value of character set NONE or OCTETS to a column that is defined with a character set that is not NONE or OCTETS undergoes a well-formedness check but is not transliterated. Firebird 2.1 will reject the value if it is not well-formed for the destination character set.
>
Additionally, well formed check will (when necessary, for example with
UTF8) be done when transliteration is not done, example when client
charset is UTF8 and column uses UTF8.


> -- from a specific character set to another specific character set, if the string is well-formed then it is transliterated to the destination character set.

The transliteration is already a kind of well-formed check. If
transliteration is needed but data is malformed, a transliteration error
will be raised.

> The slight difference here is that, if the source of the string is data already stored in the database, it is assumed to be well-formed; whereas data sourced from a "connection column", i.e., a client, is always validated.
>
> Q.
> I have a problem inserting the "eszet" character 'ß' in a database created from this script written with a text editor:
>
> create database 'c:\fb21_malformed_string.fdb'
> user 'sysdba' password 'masterkey'
> DEFAULT CHARACTER SET UTF8;
>
> create table t1 (
> id integer not null primary key,
> c1 varchar(20)
> );
>
> commit;
>
> insert into t1 (id, c1) values (1, 'ß');
>
> commit;
>
> In an isql session I try to run the script:
>
> C:\Program Files\Firebird\Firebird_210_RC2\bin>isql
> Use CONNECT or CREATE DATABASE to specify a database
> SQL> input c:\fb21_malformed_string.sql;
> Statement failed, SQLCODE = -104
> Malformed string
> After line 8 in file c:\fb21_malformed_string.sql
> SQL> show database;
> Database: c:\fb21_malformed_string.fdb
> Owner: SYSDBA
> PAGE_SIZE 4096
> Number of DB pages allocated = 196
> Sweep interval = 20000
> Forced Writes are ON
> Transaction - oldest = 6
> Transaction - oldest active = 7
> Transaction - oldest snapshot = 7
> Transaction - Next = 8
> ODS = 11.1
> Default Character set: UTF8
> SQL> exit;
>
> If the same database has been created with ISO8859_1 as its default character set, it works.

But may be still incorrect, as ISO8859_1 accepts everything.

Am I doing something wrong with regard to UFT8?
>
> A.
> Any string data in an input script must be in the character set of the target column.

When using NONE as connection charset.

> In this case, your keyboard input for creating the script would be in whatever character set your computer is set up for. As you are on Windows, your script editor would most likely have saved the script in ANSI encoding. When the engine came to process the INSERT statement, it recognised the 'ß' entry as a malformed string because its encoding was not UTF8.
>
> You would have had a similar problem trying pass this character interactively in your isql session, since you didn't start isql with the -ch switch, nor set the session's character with SET NAMES. Since the default character set for a connection is NONE, any input from your keyboard (probably ISO8859_1 or WIN1252), where the "eszet" character is single-byte, could not be well-formed UTF8, where it is two bytes with completely different encoding.
> Constants in DML scripts may not be the only source of malformed strings. Don't overlook the possibility that you might encounter an encoding problem with text in DDL scripts. For example, arguments for the COMMENT statement, embedded comments and text constants in PSQL modules, text constants in CHECK constraints, and so on, will all need to be well-formed now.
>
> Tip: Not all third-party script tools for Firebird provide a capability to save files in UTF8 encoding. On Windows, the text editor notepad.exe does give that option. To be certain, open a new notepad file and save it in UTF8 before creating any text.
>
> Q.
> My ODS 10 database had UTF8 as its default character set. I backed up my database under Fb 2.0 and restored it under Fb 2.1. I ran the metadata script as directed in the release notes and I can connect to the database. However, I am getting the "Malformed string" error when I try to select from tables containing columns that are text BLOBs. That didn't happen under Fb 2.0. What's going on here?
>
This problem is not related to default charset.


Adriano