firebird-architect - Re: UTF-8 vs UTF-16

Subject	Re: UTF-8 vs UTF-16
Author	peter_jacobi.rm
Post date	2003-08-27T07:42:12Z

Hi Dimitry, All,

(meta: it is doesn't seem so, but there is an interesting
point of debate further down)

--- In Firebird-Architect@yahoogroups.com, "Dimitry Sibiryakov":

> Actually my question was a bit different: does sorting order of a
> character (accented or not - doesn't matter) with the same binary
> representation in UNICODE (can't say better, sorry) depend on
> language where it is used?

Language and country in fact, for example german language
sorting is different in Germany, Switzerland and Austria.

And in fact that won't go away with UNICODE, it can only be
expressed somewhat cleaner.

Comparing is (at least conceptually) done by mapping
to byte strings which are then compared using
normal binary comparison. These byte strings are also
stored in the indices, so that walking the index only
involves straight binary comparisons.

The mapping from UNICODE string u to bytes string b
depends on language and country (and in weird cases
like Germany and Hungary there even two different
conventions in one country):

u -> s = sortkey (u, language, country);

Getting the sort key for a string c in another
character set can be reduced to the UNICODE case

c -> s = sortkey (to_unicode (s), language, country);

So, apart from the mapping to UNICODE, there is no
differences between character sets, for given
language and country.

But there is an issue, whether different character
sets should have different languages and locales
as defaults - i.e. what to use when no explicit
SQL COLLATE is given.

The original Firebird architecture is, that the default
locale for every character set was the "C" locale, which
knows zilch about non ASCII chars.
Sorting is done by codepoint order and uppercasing is
only done for 'a'-'z'.

In my not so humble opinion this is a source of major
frustration.

Given how much sense this makes for Cyrillic, Nickolay
at least hacked the correct uppercasing into CP1251.

So what would be a better choice for default locale
behaviour? Should each character set has its own
different default locale? This would work sort-of for
CP1251 or CP1254 (Turk, if I don't mix it up). Also CP1252
may be ceded to en-US. But which language gets the preference
for CP1250?

Doesn't seem right to me.

My personal vote goes to:

Default locale for every character set is
- sorting by UNICODE codepoint number
- uppercasing by default UNICODE uppercasing table

Benefits:
- Consistent sorting independant on charcacter set.
- Key length doesn't increase relative to current situation
- Uppercasing is 'right'

Drawbacks:
- Compatibility? What will happen to existing database.

For other roads to enlightment, ...err simplification,
see my "CHAR, NCHAR defaults" posting in this thread.

Regards,
Peter Jacobi