Subject Re: [Firebird-test] [FB-Tracker] Commented: (CORE-3738) Charset UTF8 with collate UNICODE: ORDER BY works wrong for polish characters
Author Ann Harrison
On Sat, Jan 21, 2012 at 12:52 PM, Mariusz Nogala <mnogala@...> wrote:

Quite possibly I'm responding to a month-old message that's been
answered elsewhere, but it appeared for me on Firebird-tests and
belongs to Firebird-support, so I'm responding in two places. The
message was a response to a bug report (CORE-3738) about the collation
of some Polish characters under the default UTF-8 collation.

>
> I have tried the solution with CREATE COLLATION,
> but it does not solve my problem.

You may not have followed all the steps in creating and using a collation.

> (moreover the attribute LOCALE does not work with my
> Firebird 2.5.1. I don't know why).

Some databases intuit a lot of information from the LOCALE of the
server, including character set and collation for string data, date
format, whether fractions are separated from integer parts of numbers
with a "," or a "." and inversely whether hundreds are separated by a
".' or a ",". Firebird does not. And that's a good thing, in my
opinion. When server determines all those things from its locale,
writing multi-lingual applications becomes a nightmare. My career in
databases spans the old days of "ASCII is good enough for me, it
should be good enough for you" and "If God had meant upper and lower
case to sort near each other, He wouldn't have put them so far apart
in the ASCII code set" through "Why don't Americans care about
rational sorting of upper and lower case?" and "My character set is
better than your character set" to the relative peace of Unicode and
UTF-8.

Then from time to time there's a collation problem. We all have
cultural biases - people who live at the edges of several cultures are
more aware of them than others. A bias that's hard to recognize is
"When I learned my alphabet (at the age of 4 or 5), I learned the one
true order of characters in the world. My kindergarten teacher would
never lie to me." Alas, no. Nearly every country orders words
differently - non-phonetic character sets being the most confusing to
those of us who learned them late in life. But even in phonetic
languages with Latin character sets, there are huge differences in the
"right" sorting of characters. In some Iberian langauges, "ll"
(double L) sorts after "y" and nowhere near a single lower case L.
And, in some cases, the same character ("a" with a circle over it in
Nordic languages for one) sorts in different places in different
languages.

>
> But I am not convinced.

I believe you. You want a classic Polish collation. That requires a
table - possibly quite complex - that matches each character's binary
representation with a weighted version that sorts naturally in the way
you expect. The weighted versions are often larger than the binary
representation of the character - often two bytes, sometimes three for
multi-level collations. The Unicode Standard Website probably has one
- start at Unicode.org. Then build that table into a collation object
that Firebird can use, then introduce it through the collations table,
then use that collation name as the default for the database, or for
each column that you want to have ordered that way.

> I steel think that something is wrong with collate UNICODE.

Unfortunately, the correct collation for Unicode is whatever the
Unicode standard organization says it is, however illogical it may
seem and however much it may violate your kindergarten teacher's
rules.

> The point is that it sorts single character texts properly and
> longer texts IN A DIFFERENT WAY. The sorting order should
> mainly depend on the first character in a column.

Sometimes it depends on the first two. Sometimes it depends on the
last accented character in the string (French is accent sensitive, but
considers accents from the back forward.)

> If accented characters follow the non-accented in single character texts,
> so why it does not happen when sorting longer texts?
>
> See the example:
>
> ORDER BY gives:
> a, ą, ab   (here 'ą' follows 'a', but not 'ab' - it makes no sense...)
>
> I am sure it should be:
> a, ab, ą   (because UNICODE collation is accent sensitive)
>

That is the effect of a secondary ordering. In the primary ordering a
and ą are equal, so the next letters are considered. Once the whole
string has been compared and the two are still equal, the secondary
characteristics are considered. In the secondary ordering a comes
before ą. To extend your example

a, ą, ab, ąb, abc, ac, ąc,

The correct Polish collation may consider a and ą to be different in
primary ordering, but the ordering in the Unicode collation does not.

Good luck,

Ann