Subject RE: [Firebird-Architect] UTF-8 and Compression
Author Claudio Valderrama C.
Olivier Mascia wrote:

> So a simplified view of what a collation table would be is a simple
> map of some unicode values to some other binary value. To sort
> according to the collation, the sort of the characters is done based
> on their unicode value passed through that map. Any character not
> considered by the collation is not in the map and sorts according to
> its original unicode value.

Nice simplified view for a discussion, but now if we can concentrate on real
needs, they would show the real dimension of the problem.

I think that English's a+e (like Encyclopædia and Hændel) is deprecrated, so
people in the US and UK are happy with the first 127 positions in the ASCII
table. But I find surprising that a French speaker wants to assign a single
weight to a character. Does your weight consider acute, grave and circumflex
accent? In Spanish there's only acute accent plus dieresis (umlaut for
Germans) the latter used in a few cases and the tilde is only accepted over
the n, making another character, ñ. In Portuguese, the same tilde can go
over some vowels like in the proper name João. In Czech sites, I see a lot
of diacritical signs that I don't know if it's due to my browser or what
else. Probably Danish have a precendence rule when comparing "o" and ø, "a"
and å, etc.

Are you saying that a single comparison satisfies ordering at all? Or when
you compare, you already considered those combinations as it they were
different letters? I don't think most non-geeky users would be happy with a
script-kiddie sort based on binary values or a collation that ignores
accents, spaces and upper/lower. Let's solve the ß v/s ss problem, too.
After all, dictionaries don't put words with those "subtle" difference
(diacritical marks) in random order; they always seem to follow a set of
rules, even if unknown to most native speakers. If I have a roster, I expect

instead of

as ASCII sort would provide and I still didn't consider accents.

What will happen to an indexed string field that's stored always in unicode
(as proposed) but whose charset is declared iso8859_1 and the collation is
Spanish? Will the engine create two indices?