Subject RE: [firebird-support] Unicode
Author Svend Meyland Nicolaisen
> -----Original Message-----
> From: David Johnson [mailto:d_johnson@...]
> Sent: 11. april 2005 15:28
> To:
> Subject: RE: [firebird-support] Unicode
> > What will it take to make FireBird implement correct
> dictionary sort
> > orders and upper/lower case mappings? There must be some
> fundamental
> > problem since it hasn't been done yet or what?
> >
> 1. A java-like model that separates localization from character set
> 2. True multi-byte character support
> Dictionary collation is not straightforward because language
> and culture must be taken into account. In English, accents
> on characters are not meaningful. In other languages, they
> are often not just accents, but completely different
> characters (Å in norwegian and swedish). Swiss expect
> accented characters after the end of the alphabet, while
> Germans expect them inline. In German, "ss" and "ß" should
> be considered identical. Spanish (Spain) and Spanish
> (Mexico) have different collations. Etcetera.
> In English, the collation problem is easy. Once you
> internationalize, which is the intent of Unicode, it becomes
> more complex.
> Track down the standards documents. There is a lengthy
> description of how to build collation algorithms for unicode.
> The issue is not with any single collation, but that every
> country and culture has its own expectations.

Thank you for the explanation. :-)

> > I have no problems representing Unicode with widestrings in other
> > applications so as I've said with the correct character set
> > representation conversion it must be possible to use
> widestrings with FireBird.
> >
> You can represent up to 14 bits of code points in 2 bytes (2
> bits are used to indicate that the first byte is one byte of
> a two byte pair).
> However, you cannot 21 bits of code points, which is the
> minimum representation for a minimum UTF-8 compliant
> character encoding.
> Full UTF-8 compliance demands features that do not exist in
> some implementations (which extend the potential space
> requirements to 6 bytes per character). Since they are
> generally not needed, they are often ignored. So far as I
> know, no one has established the code points for egyptian
> hieroglyphics.

OK, I wasn’t aware that Unicode reached beyond 16-bits, but now when I read
the latest specification I see that Unicode indeed compromises up to 21 bits
in the range 0x000000 to 0x10FFFF.

Widestrings are encoded in UCS-2 format and if you are careful with which
API functions you are using you can use widestrings encoded as UTF-16.
Widestrings can therefore represent up to 16 bits of code points in UCS-2
format and the full Unicode range in UTF-16. The Windows API can be used to
display widestrings encoded in UTF-16 if it is enabled on the system and the
necessary fonts are available.