Subject | Collations (was Re: UTF-8 vs UTF-16) |
---|---|
Author | adem |
Post date | 2003-08-26T15:15:44Z |
> Hi Peter,Yes, but not necessarily 256 --for those who need more.
>
> It's actually up to four array plus some extra maps.
>
> What you have in mind, is a simple one-level collation,
> the only one supported by early databases, for example
> Btrieve. There you simply have 256 sort values for 256
> characters and you're done.
> Mmmhhh. Given the fact, most users doesn't care,It is not that I dont care (well, maybe it is, I
> perhaps we should fall back to that level of support...
suppose), but an array (or a column in a table called
CHARSETS or something) seems to be able to replace the
algo you describe below, and give me the freedom and the
responsibility to specify *my own* collation order
--especially if I am dealing with less than widely
known languages.
And, since it would be a simple lookup array, it stands
a good chance that it will be faster.
> The last time Dave tried to initiate the unknowingI have read it. And, ouch! It is a very good example
> was hours ago in:
> http://groups.yahoo.com/group/Firebird-Architect/message/4828
of how the database developer needs to be an expert
on linguistics... Is this really fair on the developers?
> As I just have some long tests running, I can give a tryThanks.
> to explain it in the long form:
> Full four level collation to compare two strings.But, if I were allowed to specify the collation order in
>
> 1. Strip trailing blanks, as defined by the character set.
> 2. Do all collation defined contractions, e.g. contract
> {U+0064 U+17E0} to {U+01C6} (LATIN SMALL LETTER DZ WITH CARON) for
> Croation.
> 3. Do all collation defined expansions, e.g. expand
> {U+00F6} ("รถ", LATIn SMALL LETTER O WITH DIARESIS ) to
> {[some special character almost equal to "o"] e} for
> german phonebook sort order.
Unicode, would you still need to convert everything to
single byte? Wouldn't not doing it be faster?
> In the most often used form,the four levels are:Cheers,
> 1. Character weight
> 2. Accent weight
> 3. Case weight
> 4. Tie breaker weight
>
> So the above algorithm means:
> 1. Care for accents only when entire strings are equal
> ignoring accents
> 2. Care for case only when entire strings are equal
> ignoring case
> 3. Care for non-character distinctions ("-" vs "=") only
> when there is still a tie.
Adem