Subject Re: [Firebird-Architect] Re: UTF-8 (various)
Author Jim Starkey
johnson_dave2003 wrote:

>If you specify a column to be ISO-8859-1 (for example), then the
>metadata returned to the client should indicate that a conversion to
>8859-1 is desired in the absence of other instructions. The client,
>unless overridden by conversion explicitly defined by the application
>or environment, should honor the request when the column data is
>retrieved.
>
>
Here's the question: Should data be returned in the character set
declared in the database or a character set declared by the client? The
prevailing theory and practice for other data operations is that the
client asks for data how it wants it and the database system converts it
to that form from whatever way it was stored, if possible. It makes
sense to me that this model should prevail for character set as it would
for data type or scale. The client should declare how it wants it, and
the database system should either deliver it that way or come up with a
damn good explanation why not. I'm not prepared to say how the client
should specify it, but there should probably be a system define defined
in a configuration file that can be overruled by a runtime time. It
probably makes sense to provide a variant on getString that would allow
single string to be fetched multiple times in with different character
sets.

The current mechanism uses a dpb parameter to control character set and
collation. This should be preserved to communicate and establish a
session default, but I think we may want to define explicit calls to
change it at runtime.

The default collation should probably be managed the same way -- system
level default that can be changed at runtime. Unlike character set, the
default collation is an essential property of a query and can't be
changed as the query is prepared.


>If the application needs a character set other than the one specified
>in the DDL, then the client should be able override the ISO-8859-1
>conversion and use the appropriate conversion instead. In my
>position, I often need to post data to or from an IBM mainframe in
>CP500 EBCDIC character set with EN_US collation. Having the character
>set interpreted as a suggestion for the client to handle the string
>data rather than as a mandatory storage format eliminates a host of
>character conversions and effectively isolates the character set from
>the collation.
>
>
I'm not wild about relying on the database declaration to control or
even influence character set -- if the character set in the database is
changed, the program will run but with very "funny" results.

>Handled this way there is only ever one character set conversion - at
>the client, in accordance with the client application's actual need of
>the moment.
>
>
Yes.

>I believe that the simplification in code that this model implies
>would allow better tuning and maintenance, and will more than make up
>for the little bit of overhead associated with carrying multi-byte
>characters across the wire. If nothing else, it will allow ample time
>for incorporation of gzip or lzh as an option into the communications
>channel.
>
>
Not a fair argument. Compression of the wire protocol would benefit the
current polyglot scheme as well.

>Since the engine needs to be able to respond to requests of the form
>CREATE INDEX ... COLLATION XX_XX, and SELECT ... ORDER BY ...
>COLLATION XX_XX, the engine needs to have the mechanism for
>understanding collations, and collations are generally named in terms
>of the locale that they apply to.
>
>
It seem reasonable that an index should inherit the default collation of
the session in which it was declared. I believe this is a departure
from current practice, but would allow seamless support for a
non-English locale.

>I see collation rules being handled as follows, with the caveat that
>this is very abstract, not all levels may apply to Firebird, and not
>all levels of Firebird may be properly represented:
>
>1. Collation specified in query DML
>2. Collation specified on index DDL
>3. Collation specified on table DDL
>4. Collation specified on schema
>5. Collation specified on database
>6. Collation specified in engine configuration
>7. If no other collation is specified, default to binary
>
>
>
>
I'm happy quite happy with a shorter and simpler list:

1. Collation specified in query DML
2. Collation declared as default in the session.

This along with an intelligent default mechanism makes getting the
obviously correct results very easy.


--

Jim Starkey
Netfrastructure, Inc.
978 526-1376