firebird-architect - Re: UTF-8 (various)

Subject	Re: UTF-8 (various)
Author	johnson_dave2003
Post date	2005-03-04T03:10:26Z

--- In Firebird-Architect@yahoogroups.com, Jim Starkey <jas@n...> wrote:

> Here's the question: Should data be returned in the character set
> declared in the database or a character set declared by the client?

The

> prevailing theory and practice for other data operations is that the
> client asks for data how it wants it and the database system

converts it

> to that form from whatever way it was stored, if possible. It makes
> sense to me that this model should prevail for character set as it

would

> for data type or scale. The client should declare how it wants it, and
> the database system should either deliver it that way or come up with a
> damn good explanation why not. I'm not prepared to say how the client
> should specify it, but there should probably be a system define defined
> in a configuration file that can be overruled by a runtime time. It
> probably makes sense to provide a variant on getString that would allow
> single string to be fetched multiple times in with different character
> sets.

I don't think that multiple character sets belong in the engine - it's
nearly a religious thing. But, so long as the conversion is
transparent to the application and only happens once, in accordance
with the application's need of the moment, I can certainly set the
question of which side of the wire it occurs on aside as a place where
refactoring can be done at a later date, if required, without serious
impacts.

Under most circumstances, any differences would be marginal but
functionality would appear exactly the same to the application.

Let me flip the question around - does SQL2003 support a keyword in
DML for specifying character set? If not, is this sufficiently useful
that it is worthwhile extending from the standard? Or, would we be
better served by setting a property in the client so that it appears
that the client is doing the character conversion, regardless of
whether the character conversion occurs at the server or client?

I am in favor of adhering to the standard as closely as possible, and
I am leery of extending the standard too far without a really good reason.

>
> The current mechanism uses a dpb parameter to control character set and
> collation. This should be preserved to communicate and establish a
> session default, but I think we may want to define explicit calls to
> change it at runtime.
>
> The default collation should probably be managed the same way -- system
> level default that can be changed at runtime. Unlike character set,

the

> default collation is an essential property of a query and can't be
> changed as the query is prepared.
>

I think I understand.

The session's default collation property will be exposed for
application manipulation before the session is established, but is
immutable from the time that the session is opening until the session
is closed.

SQL 2003 supports the optional COLLATION clause in defining indexes
and order by statements. If no COLLATION clause is supplied in the
query, use the session's defaults. If the COLLATION clause is
supplied, use it. But you can't change collations in mid stream.
That is to say, within a single query you can't sort by one column in
EN_US and the neighboring column in FR_FR, or sort different parts of
a complex query in different collations.

>
> >If the application needs a character set other than the one specified
> >in the DDL, then the client should be able override the ISO-8859-1
> >conversion and use the appropriate conversion instead. In my
> >position, I often need to post data to or from an IBM mainframe in
> >CP500 EBCDIC character set with EN_US collation. Having the character
> >set interpreted as a suggestion for the client to handle the string
> >data rather than as a mandatory storage format eliminates a host of
> >character conversions and effectively isolates the character set from
> >the collation.
> >
> >
> I'm not wild about relying on the database declaration to control or
> even influence character set -- if the character set in the database is
> changed, the program will run but with very "funny" results.

I like the way your thoughts are moving ... but won't it break
existing firebird/interbase database apps to take the character set
out of the database? The intent of my suggestion was to mimic the
behavior of the current system from the perspective of the
application, while allowing the engine full latitude to move into a
single internal representation paradigm.

Of course, if you have another mecahnism in mind for not breaking
existing apps that I have missed, or you aren't concerned about
breaking existing apps, then I have no objections. If you choose
UTF-8 as the internal representation, then most of my recent and
upcoming work would require no character set conversion.

>
> >Handled this way there is only ever one character set conversion - at
> >the client, in accordance with the client application's actual need of
> >the moment.
> >
> >
> Yes.
>
> >I believe that the simplification in code that this model implies
> >would allow better tuning and maintenance, and will more than make up
> >for the little bit of overhead associated with carrying multi-byte
> >characters across the wire. If nothing else, it will allow ample time
> >for incorporation of gzip or lzh as an option into the communications
> >channel.
> >
> >
> Not a fair argument. Compression of the wire protocol would benefit

the

> current polyglot scheme as well.

The point was that the code simplification implicit with a single
internal representation and separation of character set from collation
buys the time to actually code the compression of the wire protocol.
I realize that the current wire protocol may benefit even more by
compression since it is so much more chatty than necessary. I
apologize for being unclear.

>
> >Since the engine needs to be able to respond to requests of the form
> >CREATE INDEX ... COLLATION XX_XX, and SELECT ... ORDER BY ...
> >COLLATION XX_XX, the engine needs to have the mechanism for
> >understanding collations, and collations are generally named in terms
> >of the locale that they apply to.
> >
> >
> It seem reasonable that an index should inherit the default

collation of

> the session in which it was declared. I believe this is a departure
> from current practice, but would allow seamless support for a
> non-English locale.

I concur, I think. Expressed another way, collation should default to
the collation sequence of the session, with support for COLLATION
clauses in accordance with SQL specification.

>
> >I see collation rules being handled as follows, with the caveat that
> >this is very abstract, not all levels may apply to Firebird, and not
> >all levels of Firebird may be properly represented:
> >
> >1. Collation specified in query DML
> >2. Collation specified on index DDL
> >3. Collation specified on table DDL
> >4. Collation specified on schema
> >5. Collation specified on database
> >6. Collation specified in engine configuration
> >7. If no other collation is specified, default to binary
> >
> >
> >
> >
> I'm happy quite happy with a shorter and simpler list:
>
> 1. Collation specified in query DML
> 2. Collation declared as default in the session.
>
> This along with an intelligent default mechanism makes getting the
> obviously correct results very easy.

I like the way you think. Your suggested model is simple and reliable.