firebird-architect - RE: [Firebird-Architect] Unicode collations and indices WAS: UTF-8 over UTF-16

Subject	RE: [Firebird-Architect] Unicode collations and indices WAS: UTF-8 over UTF-16
Author	Svend Meyland Nicolaisen
Post date	2005-05-03T21:34:14Z

> -----Original Message-----
> From: Firebird-Architect@yahoogroups.com
> [mailto:Firebird-Architect@yahoogroups.com] On Behalf Of Jim Starkey
> Sent: 3. maj 2005 22:43
> To: Firebird-Architect@yahoogroups.com
> Subject: Re: [Firebird-Architect] UTF-8 over UTF-16 WAS:
> Applications of Encoded Data Streams
>
> Svend Meyland Nicolaisen wrote:
>
> >Yes the Unicode Collation standard has a very good and detailed
> >description of how to implement collations for the entire Unicode
> >character set for different locales. One of the problems
> with Unicode
> >collations is that the sort key needs about four bytes per character
> >(for some special characters even more), even for strings
> that can be compressed using UTF-8 or UTF-16.
> >This means that a limit on the maximum size of indices quickly can
> >become a problem. I understand that FireBird 2 has an index
> size limit
> >of 25% of the used page size. It is much better than the
> limit set by
> >InterBase (256
> >charcters?) but might no be good enough for Unicode sort keys.
> >
> >
> >
> Collations and encoding are separate problems except,
> perhaps, for the code that implements the collation. The
> fact that the Unicode guys has a universal, world wide
> collation doesn't mean we have to use it.
> People seem quite happy with character set specific
> collations, but if somebody wanted to implement the universal
> collation, that would work, too.
>

Unicode does not have one universal collation but rather a collation for
each locale in the world (almost). To be "Unicode compliant" however you
need to implement the collations as defined in the Unicode standards.

I agree that a collation hasn't anything to do with the encoding of the
character set but it does have something to do with the demands of indices.

If your native character set is based on Unicode then I would find it
strange not to use a compliant Unicode collation that takes all the
necessary code points into account. Why use Unicode and not be truely
Unicode compliant?

/Svend