firebird-architect - Re: The case of uppercasing (+charset tailoring) [long]

Subject	Re: The case of uppercasing (+charset tailoring) [long]
Author	peter_jacobi.rm
Post date	2004-04-27T17:59:11Z

Hi David,

Thanks a lot for the extensive answer and the
historical background.

I'll want to give some initial remarks on your reply.

--- "David Schnepper" <dschnepper@b...> wrote:

> First off, by "default collation" I assume you mean
> the character-set-default collation, which is binary
> (codepoint order)

Yes, this one.

> The idea for binary character sets was to have a character set
> with NO default behavior. Binary collation. No remapping
> of characters by upper, etc.

This would make more sense to me, if these binary collations
would have done nil on uppercasing. But they do uppercase a-z.

> Certainly there was something sensible that could be done,
> but it deliberately was not implemented.
> And a primary motivation (recall this
> was all done in 1992!) was for 100% (not 99%!) compatibility
> with collation in Borland's dBASE and Paradox (and Paradox engine)
> database products.

Compatibility arguments are always good ones - at the time
of initial decision.

> French in France uppercases accented characters to
> their un-accented version UNLESS doing so leads
> to a duplicate with another word's uppercased version.

I'm under the impression, that this feature is dying
out in data processing applications. For example, LCMapString
on Win32 doesn't do it.

> The *key* issue is that collation,uppercase,lowercase,
> are all properties of a LOCALE (COLLATE),
> not of a CHARACTER SET. There needs to be a binary
> COLLATE for each CHARACTER SET. And that COLLATE
> was made the default for the character set.

The default COLLATE of a character set also defines
a LOCALE, and the uppercasing behaviour of this LOCALE
isn't the most liked feature nowadays, even (somewhat)
understanding the historical origin of this behaviour.

> There's a half-step approach which I recommend.
>
> a) Implement an "almost binary" collation for each character set.
> This "almost binary" needs a good name, and I can't think of
> one. It will implement binary collation, and "sensible"
> uppercasing behavior for each character set. (You are right,
> a "sensible" mapping works for 99%++ of applications).

Agree to your step plan.

This is a pet issue of mine, and I fear not many will agree,
but I'd prefer this "almost binary" collation to sort in
UNICODE codepoint order. The performance impact of this
simple remapping should be small enough to be tolerated.

> Let's do the half-step approach above.
>
> Let's make it easier to set a default collation, separate
> from the default character set.

Your "partID" argument needs an answer, but I'm still
thinking that in the medium to long term:

a) the default "data database default character set"
(double "default" intentional: the character set, which is the
default database character set, if no other is specified)
should be configurable and default to the system's defaults
(cp1252 collate de_DE on german Win2K box).

b) NCHAR should become UTF16 (with collate matching the choice for a))

Regards,
Peter Jacobi