Subject RE: [firebird-support] Unicode
Author David Johnson
> What will it take to make FireBird implement correct dictionary sort orders
> and upper/lower case mappings? There must be some fundamental problem since
> it hasn't been done yet or what?
>

1. A java-like model that separates localization from character set

2. True multi-byte character support

Dictionary collation is not straightforward because language and culture
must be taken into account. In English, accents on characters are not
meaningful. In other languages, they are often not just accents, but
completely different characters (Å in norwegian and swedish). Swiss
expect accented characters after the end of the alphabet, while Germans
expect them inline. In German, "ss" and "ß" should be considered
identical. Spanish (Spain) and Spanish (Mexico) have different
collations. Etcetera.

In English, the collation problem is easy. Once you internationalize,
which is the intent of Unicode, it becomes more complex.

Track down the standards documents. There is a lengthy description of
how to build collation algorithms for unicode. The issue is not with
any single collation, but that every country and culture has its own
expectations.

> I have no problems representing Unicode with widestrings in other
> applications so as I've said with the correct character set representation
> conversion it must be possible to use widestrings with FireBird.
>

You can represent up to 14 bits of code points in 2 bytes (2 bits are
used to indicate that the first byte is one byte of a two byte pair).
However, you cannot 21 bits of code points, which is the minimum
representation for a minimum UTF-8 compliant character encoding.

Full UTF-8 compliance demands features that do not exist in some
implementations (which extend the potential space requirements to 6
bytes per character). Since they are generally not needed, they are
often ignored. So far as I know, no one has established the code points
for egyptian hieroglyphics.


> Well I am aware of all the pit-falls that the Delphi UI and the fonts (can)
> give when trying to display Unicode strings. The fact that Delphi's UI only
> supports 8 bit characters, and only to some extend multi byte character
> sets, does make development somewhat more difficult, but it can be made to
> work if you can live with the limit that you only can use one 8-bit codepage
> "per string" you want to display. Using Windows API functions, like
> DrawTextW, unicode/widestrings can be displayed directly without fiddling
> with character set and/or code pages.