Subject Re: [Firebird-Architect] UTF-8 and Compression
Author Ann W. Harrison
Olivier Mascia wrote:

> So a simplified view of what a collation table would be is a simple map
> of some unicode values to some other binary value. To sort according to
> the collation, the sort of the characters is done based on their
> unicode value passed through that map. Any character not considered by
> the collation is not in the map and sorts according to its original
> unicode value.

Actually, the collation problem will continue to be more complicated
than that because most serious collations are multi-level. There
isn't a simple byte for byte transformation that makes the sort "work".
The way the current collations work is to assign a byte value to the
"base" character (e.g. 'A'), then append a bits that give value to
different accents, upper vs lower case, and the following white space or
punctuation.

Here is Dave Schnepper's explanation of the issue:

The InterBase collation orders for ISO8859 (such as SV_SV) follow a full
linguistic (eg: dictionary) collation order. In such a collation order
spaces (and other punctuation marks) are of 4th level importance.

First order: A is different than B
2nd order: A is different from A-accent-grave
3rd order: A is different than a
4th order: The type of punctuation mark is important.

For instance:
Redwing
Red wing
Red-wing
Redwood
Red wood
Red worm

If spaces (& other punctuation) are treated as a first order difference the
list
becomes sorted as
Redwing
Redwood
Red wing
Red wood
Red worm
Red-wing

Which may be desirable, but isn't a dictionary sort.


Regards,


Ann