Subject Re: [Firebird-Architect] Re: UTF-8 (various)
Author Jim Starkey
johnson_dave2003 wrote:

>Let me flip the question around - does SQL2003 support a keyword in
>DML for specifying character set? If not, is this sufficiently useful
>that it is worthwhile extending from the standard? Or, would we be
>better served by setting a property in the client so that it appears
>that the client is doing the character conversion, regardless of
>whether the character conversion occurs at the server or client?
>
I don't think it belongs in the DML but in the session. But in any
case, a default collation is necessary to correctly interprete the
semantics of a DML statement. Whether it is in the session, the DML, or
both is a question of taste. I have an opinion, certainly, but where
and how the default collation is specified isn't semantically significant.

>I am in favor of adhering to the standard as closely as possible, and
>I am leery of extending the standard too far without a really good reason.
>
>
We should figure out a) what the right thing to do, b) what the standard
says, and c) what makes the most sense for Firebird. So far we've
managed to avoid adopting the SQL API....

>>The current mechanism uses a dpb parameter to control character set and
>>collation. This should be preserved to communicate and establish a
>>session default, but I think we may want to define explicit calls to
>>change it at runtime.
>>
>>The default collation should probably be managed the same way -- system
>>level default that can be changed at runtime. Unlike character set,
>>
>>
>the
>
>
>>default collation is an essential property of a query and can't be
>>changed as the query is prepared.
>>
>>
>>
>The session's default collation property will be exposed for
>application manipulation before the session is established, but is
>immutable from the time that the session is opening until the session
>is closed.
>
I think that's too restrictive. A collation ultimately is a parameter
required to compile a statement. Whether it is am immutable session
property, a changeable session property, or an explicit parameter to
statement compilation is the question on the table.

>SQL 2003 supports the optional COLLATION clause in defining indexes
>and order by statements. If no COLLATION clause is supplied in the
>query, use the session's defaults. If the COLLATION clause is
>supplied, use it. But you can't change collations in mid stream.
>That is to say, within a single query you can't sort by one column in
>EN_US and the neighboring column in FR_FR, or sort different parts of
>a complex query in different collations.
>
Collation is also requires to evaluate greater than / less than. The
collation that a program is expecting is a great deal more important
than how a particular index was defined. It boils down to the
fundamental question: Which is more important, going fast or getting the
right answer. If the pick the former, all queries can return 13.

>
>
>
>
>
>>>If the application needs a character set other than the one specified
>>>in the DDL, then the client should be able override the ISO-8859-1
>>>conversion and use the appropriate conversion instead. In my
>>>position, I often need to post data to or from an IBM mainframe in
>>>CP500 EBCDIC character set with EN_US collation. Having the character
>>>set interpreted as a suggestion for the client to handle the string
>>>data rather than as a mandatory storage format eliminates a host of
>>>character conversions and effectively isolates the character set from
>>>the collation.
>>>
>>>
>>>
>>>
>>I'm not wild about relying on the database declaration to control or
>>even influence character set -- if the character set in the database is
>>changed, the program will run but with very "funny" results.
>>
>>
>
>I like the way your thoughts are moving ... but won't it break
>existing firebird/interbase database apps to take the character set
>out of the database? The intent of my suggestion was to mimic the
>behavior of the current system from the perspective of the
>application, while allowing the engine full latitude to move into a
>single internal representation paradigm.
>
Let's be clear about context. The existing XSQLDA based interface must
preserve current semantics. I'm thinking in terms a new interface that
must simultaneously support faithful emulation of our legacy interface
and future "civilized" interfaces.

>
>
>Of course, if you have another mecahnism in mind for not breaking
>existing apps that I have missed, or you aren't concerned about
>breaking existing apps, then I have no objections. If you choose
>UTF-8 as the internal representation, then most of my recent and
>upcoming work would require no character set conversion.
>
I am an advocate of supporting indefinitely all of the dumb things we've
done in the past as well as the really clever things we'll think of in
the future. I want to freeze the XSQLDA DSQL interface as legacy
interface and work towards something that we can live and go forward with.

>>>I believe that the simplification in code that this model implies
>>>would allow better tuning and maintenance, and will more than make up
>>>for the little bit of overhead associated with carrying multi-byte
>>>characters across the wire. If nothing else, it will allow ample time
>>>for incorporation of gzip or lzh as an option into the communications
>>>channel.
>>>
>>>
>>>
>>>
>>Not a fair argument. Compression of the wire protocol would benefit
>>
>>
>the
>
>
>>current polyglot scheme as well.
>>
>>
>
>The point was that the code simplification implicit with a single
>internal representation and separation of character set from collation
>buys the time to actually code the compression of the wire protocol.
>I realize that the current wire protocol may benefit even more by
>compression since it is so much more chatty than necessary. I
>apologize for being unclear.
>
I'm more concerned with the number of round trips than the number of
bytes. That said, there isn't a great deal that can be done to reduce
the number of round trips other than blocking records. But I'm open to
suggestions. Folks, this is the time to get creative. What can we do
in the API to support a wire protocol with fewer round trips?

>>>I see collation rules being handled as follows, with the caveat that
>>>this is very abstract, not all levels may apply to Firebird, and not
>>>all levels of Firebird may be properly represented:
>>>
>>>1. Collation specified in query DML
>>>2. Collation specified on index DDL
>>>3. Collation specified on table DDL
>>>4. Collation specified on schema
>>>5. Collation specified on database
>>>6. Collation specified in engine configuration
>>>7. If no other collation is specified, default to binary
>>>
>>>
>>>
>>>
>>>
>>>
>>I'm happy quite happy with a shorter and simpler list:
>>
>> 1. Collation specified in query DML
>> 2. Collation declared as default in the session.
>>
>>This along with an intelligent default mechanism makes getting the
>>obviously correct results very easy.
>>
>>
>
>I like the way you think. Your suggested model is simple and reliable.
>
Gosh, Mom, I like your new email alias...