Subject | Collations (was Re: UTF-8 vs UTF-16) |
---|---|
Author | peter_jacobi.rm |
Post date | 2003-08-26T10:22:31Z |
Hi adem,
In Firebird-Architect@yahoogroups.com, "adem" wrote:
What you have in mind, is a simple one-level collation,
the only one supported by early databases, for example
Btrieve. There you simply have 256 sort values for 256
characters and you're done.
Mmmhhh. Given the fact, most users doesn't care,
perhaps we should fall back to that level of support...
The last time Dave tried to initiate the unknowing
was hours ago in:
http://groups.yahoo.com/group/Firebird-Architect/message/4828
As I just have some long tests running, I can give a try
to explain it in the long form:
Full four level collation to compare two strings.
1. Strip trailing blanks, as defined by the character set.
2. Do all collation defined contractions, e.g. contract
{U+0064 U+17E0} to {U+01C6} (LATIN SMALL LETTER DZ WITH CARON) for
Croation.
3. Do all collation defined expansions, e.g. expand
{U+00F6} ("รถ", LATIn SMALL LETTER O WITH DIARESIS ) to
{[some special character almost equal to "o"] e} for
german phonebook sort order.
Steps 4..7
FOR N = 1 TO 4 DO BEGIN
Translate both strings using Nth level sort
values. Note, that not all characters may
have a Nth level sort value - these are just
ignored in this step.
Compare using these translated strings, if not equal BREAK.
END
In the most often used form,the four levels are:
1. Character weight
2. Accent weight
3. Case weight
4. Tie breaker weight
So the above algorithm means:
1. Care for accents only when entire strings are equal
ignoring accents
2. Care for case only when entire strings are equal
ignoring case
3. Care for non-character distinctions ("-" vs "=") only
when there is still a tie.
Regards,
Peter Jacobi
In Firebird-Architect@yahoogroups.com, "adem" wrote:
> Again, please forgive my ignorance, but deep down,It's actually up to four array plus some extra maps.
> isn't a collation order some form of an array where
> on one side is the charcode on the other is the sequence
> number of it? If so, what do you mean by collation
> compiler?
What you have in mind, is a simple one-level collation,
the only one supported by early databases, for example
Btrieve. There you simply have 256 sort values for 256
characters and you're done.
Mmmhhh. Given the fact, most users doesn't care,
perhaps we should fall back to that level of support...
The last time Dave tried to initiate the unknowing
was hours ago in:
http://groups.yahoo.com/group/Firebird-Architect/message/4828
As I just have some long tests running, I can give a try
to explain it in the long form:
Full four level collation to compare two strings.
1. Strip trailing blanks, as defined by the character set.
2. Do all collation defined contractions, e.g. contract
{U+0064 U+17E0} to {U+01C6} (LATIN SMALL LETTER DZ WITH CARON) for
Croation.
3. Do all collation defined expansions, e.g. expand
{U+00F6} ("รถ", LATIn SMALL LETTER O WITH DIARESIS ) to
{[some special character almost equal to "o"] e} for
german phonebook sort order.
Steps 4..7
FOR N = 1 TO 4 DO BEGIN
Translate both strings using Nth level sort
values. Note, that not all characters may
have a Nth level sort value - these are just
ignored in this step.
Compare using these translated strings, if not equal BREAK.
END
In the most often used form,the four levels are:
1. Character weight
2. Accent weight
3. Case weight
4. Tie breaker weight
So the above algorithm means:
1. Care for accents only when entire strings are equal
ignoring accents
2. Care for case only when entire strings are equal
ignoring case
3. Care for non-character distinctions ("-" vs "=") only
when there is still a tie.
Regards,
Peter Jacobi