firebird-architect - RE: [Firebird-Architect] The case of uppercasing (+charset tailoring) [long]

Subject	RE: [Firebird-Architect] The case of uppercasing (+charset tailoring) [long]
Author	David Schnepper
Post date	2004-04-27T16:05:21Z

> -----Original Message-----
> From: peter_jacobi.rm [mailto:peter_jacobi@...]
> Sent: Tuesday, April 27, 2004 1:00 AM
> To: Firebird-Architect@yahoogroups.com
> Subject: [Firebird-Architect] The case of uppercasing (+charset
> tailoring) [long]
>
>
> Dear All,
>
> An often encountered surprise when using
> Firebird's character sets, is the uppercasing
> behaviour.
>
> The default (codepoint order) collations only
> uppercase a-z and leave all other codepoints
> invariant. By desperate user needs this was
> already changed for some cyrillic charsets.
>
> I'm not an expert in Interbase history, but I
> assume this problem did originate by overly
> pessimistic thinking about culture correct
> uppercasing. "As uppercasing is culture-dependant",
> the reasoning may go, "we can do nothing sensible
> in the default collation".

Speaking as the author of Interbase internationalization,
I respond "that isn't the whole story"!

First off, by "default collation" I assume you mean
the character-set-default collation, which is binary
(codepoint order) -- and not the "database default character
set" - which can be anything.

The idea for binary character sets was to have a character set
with NO default behavior. Binary collation. No remapping
of characters by upper, etc.

Certainly there was something sensible that could be done,
but it deliberately was not implemented.
And a primary motivation (recall this
was all done in 1992!) was for 100% (not 99%!) compatibility
with collation in Borland's dBASE and Paradox (and Paradox engine)
database products. These products had drivers which
did codepoint order collation, and no uppercase driving.
Interbase needed to have exactly the same support (or,
so went the thinking at the time).

Now, perhaps these binary drivers should not have been the
default driver for each supported character set, but,
really, having a non-binary driver be the default did
seem a little silly.

Particularly when the documented model for Interbase was
you HAD to declare your character columns with a character
set and collation in order to obtain culturally correct
collation order.

>
> But this is totally out of proportion, as
> uppercasing is only to a tiny percentage culture
> dependant, and you are a true Unicode expert if
> you can name another area of dependencies besides
> the relatively well known Turkish dotless I case.

French in France uppercases accented characters to
their un-accented version UNLESS doing so leads
to a duplicate with another word's uppercased version.

French in Canada always uppercases to the accented
version of the character.

The *key* issue is that collation,uppercase,lowercase,
are all properties of a LOCALE (COLLATE),
not of a CHARACTER SET. There needs to be a binary
COLLATE for each CHARACTER SET. And that COLLATE
was made the default for the character set.

>
> So, in my (in this case not so humble) opinion, this
> behaviour must go and uppercasing must become default
> Unicode uppercasing for all character sets (plus minor
> tailoring for Turkish).
>
> The question is about the mode and schedule of deployment.
>
> By changing the uppercasing behaviour, existing databases
> may break (but the change for cyrillic did go through
> without major requests to kill somebody for it).
>
> So, are we doomed to leave default collation uppercasing
> as-is, or should it just be corrected, hoping for the best?
>
> Is there a "dialect 4" in sight? Should it be done alongside
> with a totally new ODS only?

There's a half-step approach which I recommend.

a) Implement an "almost binary" collation for each character set.
This "almost binary" needs a good name, and I can't think of
one. It will implement binary collation, and "sensible"
uppercasing behavior for each character set. (You are right,
a "sensible" mapping works for 99%++ of applications).

b) Ship the above with any release.

c) In some future release, change the "default order" for each
character set to be the new ordering.

d) Keep the existing pure-binary orderings -- they just aren't the default.

It's possible to combine steps b & c with careful documentation.
Existing applications will continue to use the pure-binary
collations, newly defined schemas (after step c) will use
the new ordering. Extracted schemas will explicitly
define their ordering.

>
> Touching a vaguely related point, the problem with the
> default collation, is not only that it's the default, it's
> also the only one that can be easily specified for a
> complete database, as the db's default character set can
> be set, but not the the db's default collation.

But do you really WANT to set a db's default collation?
I think yes, if all you're looking for is a change in
uppercasing behavior, but NO if you're really going to
use expensive collation. Using FR_FR (for example) on
a PART_ID field is really inappropriate. You don't need
the performance overhead, and a listing by PART_ID would
not be correct.

>
> This gap can be closed by the SQL standard's (otherwise
> rather useless) feature to dynmically create new characacter
> sets, as you can create a new character set having the desired
> collation as default.

Yes, that was the model for how to "set a default character set
and collation". I've done this myself by doing the appropriate
sql statements (insert into rdb$character_sets, etc). I think
I even made a stored procedure somewhere to do it.

>
> Are there opinions on this idea?

Let's do the half-step approach above.

Let's make it easier to set a default collation, separate
from the default character set.

Dave Schnepper

>