Subject RE: [Firebird-Architect] Re: UTF-8 vs UTF-16
Author David Schnepper
> Now about comparing : testing string "école" (e-cute cole) and "ecole"
> (no accents) for equality should return FALSE (different). Of course,
> "ecole" (no accent) is wrong but that would not make an equality
> comparison succeed.
>

As Peter pointed out, the 4 level collation algorithms are designed
for tie breaking. There is a defined order between the following
strings:

ecole
école
Ecole
École
e-cole
E-cole
ec-ole
ec.ole
écolé
ec ole
e cole
ecoLE
éçólê
ÉÇÓLÊ


(Aside, these are not necessarily in the "correct" order -- I was
just showing the different varients that are possible with the same
sequence of base characters).

In the dictionary, they will all sort near each other, but
there is a defined sequence they will appear in.
By primary ordering, all the above are equal,
by secondary ordering, the accented characters are grouped,
by 3rd ordering, the case differences are grouped
by 4th ordering, the punctuation differences are grouped.

If it survives all that, the strings must be equal
<grin>

In your look-in-the-dictionary example:
-----------
David, here are some extracts from a french dictionnary :

eau-de-vie (no accents, means brandy)
ébahi (e-cute bahi, means staggered)
...
écarter (e-cute carter, means to spread)
ecchymose (no accents, means bruise)
ecclésiastique (eccl e-cute siastique, means ecclesiastical)
écervelé (e-cute cervel e-cute, means scatty)
...

So you can clearly see that the sequence does not take accents into
account. This is true for all accents in french. These sentences are
based on everyday-life facts in french. But I'm sure I could get some
official word on it somewhere. I'll have a look.
------------
None of the examples have the same primary characters, so it's
the same as sorting

eau-de-vie
ebahi
ecarter
ecchymose
ecclesiastique
ecervele




> On the other hand looking for all words starting with "eco" (no
> accents) should ideally return "école" (e-cute cole) as well as words
> starting with the no-accent 'e'.

Agreed. That's the purpose of the special entry in the collation
driver interface -- which is used to support starting_with
and "like 'abc%'". It gives a "partial key" for the string,
using primary collation only - so that the partial key can be
used for a range search against the index.

>
> In a no-case sorting, 'e', 'é' (e-cute), 'E', 'É' (E-cute) should all
> be equivalent. While in a typical case-sensitive sort, 'e' and 'é'
> would be equal and AFTER 'E' and 'É' (equal too).

The first case is implemented as FR_FR_NOCASE_NOACCENT in ibCollate.
The 2nd case would be FR_FR_NOACCENT, but I don't have
an implementation for it.

Dave


>
> --
> Best regards,
> Olivier Mascia
>
>
> To unsubscribe from this group, send an email to:
> Firebird-Architect-unsubscribe@yahoogroups.com
>
>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>