firebird-architect - Collations (was Re: UTF-8 vs UTF-16)

Subject	Collations (was Re: UTF-8 vs UTF-16)
Author	peter_jacobi.rm
Post date	2003-08-26T10:22:31Z

Hi adem,

In Firebird-Architect@yahoogroups.com, "adem" wrote:

> Again, please forgive my ignorance, but deep down,
> isn't a collation order some form of an array where
> on one side is the charcode on the other is the sequence
> number of it? If so, what do you mean by collation
> compiler?

It's actually up to four array plus some extra maps.

What you have in mind, is a simple one-level collation,
the only one supported by early databases, for example
Btrieve. There you simply have 256 sort values for 256
characters and you're done.

Mmmhhh. Given the fact, most users doesn't care,
perhaps we should fall back to that level of support...

The last time Dave tried to initiate the unknowing
was hours ago in:
http://groups.yahoo.com/group/Firebird-Architect/message/4828

As I just have some long tests running, I can give a try
to explain it in the long form:

Full four level collation to compare two strings.

1. Strip trailing blanks, as defined by the character set.
2. Do all collation defined contractions, e.g. contract
{U+0064 U+17E0} to {U+01C6} (LATIN SMALL LETTER DZ WITH CARON) for
Croation.
3. Do all collation defined expansions, e.g. expand
{U+00F6} ("ö", LATIn SMALL LETTER O WITH DIARESIS ) to
{[some special character almost equal to "o"] e} for
german phonebook sort order.

Steps 4..7
FOR N = 1 TO 4 DO BEGIN
Translate both strings using Nth level sort
values. Note, that not all characters may
have a Nth level sort value - these are just
ignored in this step.
Compare using these translated strings, if not equal BREAK.
END

In the most often used form,the four levels are:
1. Character weight
2. Accent weight
3. Case weight
4. Tie breaker weight

So the above algorithm means:
1. Care for accents only when entire strings are equal
ignoring accents
2. Care for case only when entire strings are equal
ignoring case
3. Care for non-character distinctions ("-" vs "=") only
when there is still a tie.

Regards,
Peter Jacobi