firebird-architect - Collations (was Re: UTF-8 vs UTF-16)

Subject	Collations (was Re: UTF-8 vs UTF-16)
Author	adem
Post date	2003-08-26T15:15:44Z

> Hi Peter,
>
> It's actually up to four array plus some extra maps.
>
> What you have in mind, is a simple one-level collation,
> the only one supported by early databases, for example
> Btrieve. There you simply have 256 sort values for 256
> characters and you're done.

Yes, but not necessarily 256 --for those who need more.

> Mmmhhh. Given the fact, most users doesn't care,
> perhaps we should fall back to that level of support...

It is not that I dont care (well, maybe it is, I
suppose), but an array (or a column in a table called
CHARSETS or something) seems to be able to replace the
algo you describe below, and give me the freedom and the
responsibility to specify *my own* collation order
--especially if I am dealing with less than widely
known languages.

And, since it would be a simple lookup array, it stands
a good chance that it will be faster.

> The last time Dave tried to initiate the unknowing
> was hours ago in:
> http://groups.yahoo.com/group/Firebird-Architect/message/4828

I have read it. And, ouch! It is a very good example
of how the database developer needs to be an expert
on linguistics... Is this really fair on the developers?

> As I just have some long tests running, I can give a try
> to explain it in the long form:

Thanks.

> Full four level collation to compare two strings.
>
> 1. Strip trailing blanks, as defined by the character set.
> 2. Do all collation defined contractions, e.g. contract
> {U+0064 U+17E0} to {U+01C6} (LATIN SMALL LETTER DZ WITH CARON) for
> Croation.
> 3. Do all collation defined expansions, e.g. expand
> {U+00F6} ("ö", LATIn SMALL LETTER O WITH DIARESIS ) to
> {[some special character almost equal to "o"] e} for
> german phonebook sort order.

But, if I were allowed to specify the collation order in
Unicode, would you still need to convert everything to
single byte? Wouldn't not doing it be faster?

> In the most often used form,the four levels are:
> 1. Character weight
> 2. Accent weight
> 3. Case weight
> 4. Tie breaker weight
>
> So the above algorithm means:
> 1. Care for accents only when entire strings are equal
> ignoring accents
> 2. Care for case only when entire strings are equal
> ignoring case
> 3. Care for non-character distinctions ("-" vs "=") only
> when there is still a tie.

Cheers,
Adem