firebird-architect - Re: UTF-8 (various)

Subject	Re: UTF-8 (various)
Author	johnson_dave2003
Post date	2005-03-05T02:37:24Z

--- In Firebird-Architect@yahoogroups.com, Daniel Rail <daniel@a...>
wrote:

> Hello johnson_dave2003,
>
> Friday, March 4, 2005, 3:28:05 PM, you wrote:
>
> > Firebird/Interbase SQL supports the CHARSET keyword, but SQL2003 does
> > not have this as a standard feature.
>
> Here's what is directly from SQL-2003 standard:
>
> <data type> ::=
> <predefined type>
> | <row type>
> | <path-resolved user-defined type name>
> | <reference type>
> | <collection type>
>
> <predefined type> ::=
> <character string type> [ CHARACTER SET <character set specification> ]
> [ <collate clause> ]
> | <national character string type> [ <collate clause> ]
> | <binary large object string type>
> | <numeric type>
> | <boolean type>
> | <datetime type>
> | <interval type>
>

OOPS! Too much reliance on O'Reilly books and not enough attention to
the original source. I plead the 5th and too much coffee time.

> And, <data type> is the data type that you specify for a column.
> Also, Firebird supports feature F461 "Named character sets", which
> means that character sets are supported using the character set DDL
> and DML syntax. One thing that is mentioned in the SQL-2003 in
> regards to the character sets, is that it is more or less
> implementation defined as to how it works internally.
>
> > The entire intent of this approach is to remove the concept
> > of "character sets" as a plural from the server engine entirely. If
> > this concept is ultimately accepted, characters would be stored as
> > UTF-8 (period), and the retrieval process (whether on the client or
> > the server) converts UTF-8 to whatever is needed "on demand".
>
> I have no problem with that. What I was referring to, was making the
> case about using more than one collation simultaneously in a WHERE
> clause and/or ORDER BY clause.

I agree in principle. I was trying to restate my understanding of
Jim's take. Subsequent postings have established that multi-language
collations in one query are desirable.

The model I prefer for implementation, where each Collation is coded
as a class descended from an AbstractCollation class certainly permits
this.

>
> >>
> >> One thing that would be nice to have removed is having to specify a
> >> character set on connection, especially if it's possible to have
> >> multiple character sets defined within the database(i.e.: fields,
> >> domains, etc...).
>
> > The entire intent of this approach is to remove the concept
> > of "character sets" as a plural from the server engine, and turn them
> > over to one end of the session connection.
>
> You'll still have to specify which character set will be used to
> "represent" the field data, even if the data is saved internally with
> the same dataset.

In my somewhat overbearing and often underinformed opinion, that
should be a function of the client system.

If I try to display ISO-8859-1 data on a CP-500 terminal, even though
both character sets contain the same characters, all I see is garbage
because very few of the characters have the same binary representations.

I see a need to clearly identify order of precedence for character set
conversions, and from that order of precedence we need to determine
which end of the session should be responsible for character code
translation.

In the interest of preservation of currently expected behaviors, I
suggest the following order of precedence:

1. CHARACTER SET modifier on DML
2. CHARACTER SET modifier on column
3. Session default character set

The need to respond to the modifier in the DML suggests that the place
for this conversion is probably on the server side of the session (as
much as I hate to admit it).