Subject | Re: The case of uppercasing (+charset tailoring) [long] |
---|---|
Author | peter_jacobi.rm |
Post date | 2004-04-27T17:59:11Z |
Hi David,
Thanks a lot for the extensive answer and the
historical background.
I'll want to give some initial remarks on your reply.
--- "David Schnepper" <dschnepper@b...> wrote:
would have done nil on uppercasing. But they do uppercase a-z.
of initial decision.
out in data processing applications. For example, LCMapString
on Win32 doesn't do it.
a LOCALE, and the uppercasing behaviour of this LOCALE
isn't the most liked feature nowadays, even (somewhat)
understanding the historical origin of this behaviour.
This is a pet issue of mine, and I fear not many will agree,
but I'd prefer this "almost binary" collation to sort in
UNICODE codepoint order. The performance impact of this
simple remapping should be small enough to be tolerated.
thinking that in the medium to long term:
a) the default "data database default character set"
(double "default" intentional: the character set, which is the
default database character set, if no other is specified)
should be configurable and default to the system's defaults
(cp1252 collate de_DE on german Win2K box).
b) NCHAR should become UTF16 (with collate matching the choice for a))
Regards,
Peter Jacobi
Thanks a lot for the extensive answer and the
historical background.
I'll want to give some initial remarks on your reply.
--- "David Schnepper" <dschnepper@b...> wrote:
> First off, by "default collation" I assume you meanYes, this one.
> the character-set-default collation, which is binary
> (codepoint order)
> The idea for binary character sets was to have a character setThis would make more sense to me, if these binary collations
> with NO default behavior. Binary collation. No remapping
> of characters by upper, etc.
would have done nil on uppercasing. But they do uppercase a-z.
> Certainly there was something sensible that could be done,Compatibility arguments are always good ones - at the time
> but it deliberately was not implemented.
> And a primary motivation (recall this
> was all done in 1992!) was for 100% (not 99%!) compatibility
> with collation in Borland's dBASE and Paradox (and Paradox engine)
> database products.
of initial decision.
> French in France uppercases accented characters toI'm under the impression, that this feature is dying
> their un-accented version UNLESS doing so leads
> to a duplicate with another word's uppercased version.
out in data processing applications. For example, LCMapString
on Win32 doesn't do it.
> The *key* issue is that collation,uppercase,lowercase,The default COLLATE of a character set also defines
> are all properties of a LOCALE (COLLATE),
> not of a CHARACTER SET. There needs to be a binary
> COLLATE for each CHARACTER SET. And that COLLATE
> was made the default for the character set.
a LOCALE, and the uppercasing behaviour of this LOCALE
isn't the most liked feature nowadays, even (somewhat)
understanding the historical origin of this behaviour.
> There's a half-step approach which I recommend.Agree to your step plan.
>
> a) Implement an "almost binary" collation for each character set.
> This "almost binary" needs a good name, and I can't think of
> one. It will implement binary collation, and "sensible"
> uppercasing behavior for each character set. (You are right,
> a "sensible" mapping works for 99%++ of applications).
This is a pet issue of mine, and I fear not many will agree,
but I'd prefer this "almost binary" collation to sort in
UNICODE codepoint order. The performance impact of this
simple remapping should be small enough to be tolerated.
> Let's do the half-step approach above.Your "partID" argument needs an answer, but I'm still
>
> Let's make it easier to set a default collation, separate
> from the default character set.
thinking that in the medium to long term:
a) the default "data database default character set"
(double "default" intentional: the character set, which is the
default database character set, if no other is specified)
should be configurable and default to the system's defaults
(cp1252 collate de_DE on german Win2K box).
b) NCHAR should become UTF16 (with collate matching the choice for a))
Regards,
Peter Jacobi