firebird-architect - RE: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams

Subject	RE: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams
Author	Svend Meyland Nicolaisen
Post date	2005-05-03T20:14:22Z

> -----Original Message-----
> From: Firebird-Architect@yahoogroups.com
> [mailto:Firebird-Architect@yahoogroups.com] On Behalf Of Brad Pepers
> Sent: 3. maj 2005 19:42
> To: Firebird-Architect@yahoogroups.com
> Subject: Re: [Firebird-Architect] UTF-8 over UTF-16 WAS:
> Applications of Encoded Data Streams
>
> My third-party opinion would be to use Unicode everywhere
> internally and to pick one encoding to use for it (UTF-8 or
> UTF-16 being the most obvious choices). Then you will also
> need the ability for the client to specify the character set
> of the text its expecting to use (which could include setting
> it to UTF-8 or UTF-16 which if it matched the encoding used
> internally would mean no encoding or decoding would be required).
> You should also allow specification of the character set for
> a database which would only be used as a filter of text
> received from the client to strip out characters it shouldn't
> be passing in. And finally you need to be able to specify
> the locale in order to do sorting and upper/lower case
> conversions properly.
>
> From this I can see a CharacterSet base class that defines
> conversion of a block of text from/to the internal Unicode
> encoding and returns the name of the character set. Make
> subclasses of this be loadable modules so you can add new
> character sets easily. The Locale portion is more complex
> since the first design approximation is to have the same sort
> of class heirarchy as just mentioned for CharacterSet with
> methods to compare two Unicode strings and order them for a
> given locale and methods to convert a Unicode string to
> uppercase or lowercase which I think would all work but I
> think there are some good Unicode docs on this whole issue
> since its more complicated than you might think. So a review
> of what the Unicode Council has to say on collation of
> Unicode strings would be a good starting point and there may
> even be some code around to do a lot of this. I'm using the
> ICU code from IBM interally in my product to handle a lot of
> I18N and L10N issues.
>

Yes the Unicode Collation standard has a very good and detailed description
of how to implement collations for the entire Unicode character set for
different locales. One of the problems with Unicode collations is that the
sort key needs about four bytes per character (for some special characters
even more), even for strings that can be compressed using UTF-8 or UTF-16.
This means that a limit on the maximum size of indices quickly can become a
problem. I understand that FireBird 2 has an index size limit of 25% of the
used page size. It is much better than the limit set by InterBase (256
charcters?) but might no be good enough for Unicode sort keys.

/smn