Subject Fw: [Firebird-Architect] Re: Character Set Support
Author peter_jacobi.rm
Hi Daniel,

--- In Firebird-Architect@yahoogroups.com, Daniel Rail wrote:
> One scenario that is possible to have with some of my applications,
> and that is to sort english, spanish and french names, just to
> name a
> few, that are mixed together. Which would be the best collation to
> implement to support this kind of sort?

Conceptually, it doesn't depend on the language the database data is
in, but on the language of the user. The user should be presented
with a sort order matching his expectations.

If you don't want to speculate on the user's expectations or expect
a number of users from different cultures, Unicode default sort
order is called for. I assume charset ISO-8859-1 collate EN_US would
be nearest FB equivalent.

> Would an accent insensitive
> collation be worth while? Those 3 languages are the most common
> that I deal with.

In my understanding the issue of accent insensitive (and case
insensitive) collations is bogus. Unfortunately some limitation
in FB make it difficult to get totally rid of this demand.

Let me elaborate:

For sorting purpose, you usually don't want that strings, only
differing in accents go in random relative order. But that's what
a true accent insensitive collation would mean. You want them
next to each other and apart from all strings which have truly
distinct characters. But for this purpose, the <language>_<country>
multi-level collations, already do the job!

For searching purposes, it would be nice to re-use the same index if
any as for sorting. When a accent insensitive search is called for,
can the multi-level collations be used? All strings matching the
accent insensitive search criteria form an interval in the multi-level
collation, so a BETWEEN query would return them, but there is missing
a builtin method to query the collation for the interval borders which
should be used for this query. This must be faked within the app logic.

Of course a brute-force, case insensitive, accent insensitive
collation would have the benefit of using only one byte per character,
keeping the indices smaller and giving a larger limit of indexable
column length.

Such a beast is included in my pjcolkit_ver_0_3 at
http://www.jodelpeter.de/i18n/fbarch/index.htm

It would be a small but rewarding exercise to extend this to UTF-8, so
let's hope I have some time next weekend. (But the promised Thai
collation is months overdue, I seriously need holidays for doing more
FB stuff).

Regards,
Peter Jacobi



>
> --
> Best regards,
> Daniel Rail
> Senior Software Developer
> ACCRA Group Inc. (www.accra.ca)
> ACCRA Med Software Inc. (www.filopto.com)