Subject | Re: [Firebird-Architect] UTF-8 and Compression |
---|---|
Author | Ann W. Harrison |
Post date | 2005-03-01T18:49:48Z |
Olivier Mascia wrote:
than that because most serious collations are multi-level. There
isn't a simple byte for byte transformation that makes the sort "work".
The way the current collations work is to assign a byte value to the
"base" character (e.g. 'A'), then append a bits that give value to
different accents, upper vs lower case, and the following white space or
punctuation.
Here is Dave Schnepper's explanation of the issue:
The InterBase collation orders for ISO8859 (such as SV_SV) follow a full
linguistic (eg: dictionary) collation order. In such a collation order
spaces (and other punctuation marks) are of 4th level importance.
First order: A is different than B
2nd order: A is different from A-accent-grave
3rd order: A is different than a
4th order: The type of punctuation mark is important.
For instance:
Redwing
Red wing
Red-wing
Redwood
Red wood
Red worm
If spaces (& other punctuation) are treated as a first order difference the
list
becomes sorted as
Redwing
Redwood
Red wing
Red wood
Red worm
Red-wing
Which may be desirable, but isn't a dictionary sort.
Regards,
Ann
> So a simplified view of what a collation table would be is a simple mapActually, the collation problem will continue to be more complicated
> of some unicode values to some other binary value. To sort according to
> the collation, the sort of the characters is done based on their
> unicode value passed through that map. Any character not considered by
> the collation is not in the map and sorts according to its original
> unicode value.
than that because most serious collations are multi-level. There
isn't a simple byte for byte transformation that makes the sort "work".
The way the current collations work is to assign a byte value to the
"base" character (e.g. 'A'), then append a bits that give value to
different accents, upper vs lower case, and the following white space or
punctuation.
Here is Dave Schnepper's explanation of the issue:
The InterBase collation orders for ISO8859 (such as SV_SV) follow a full
linguistic (eg: dictionary) collation order. In such a collation order
spaces (and other punctuation marks) are of 4th level importance.
First order: A is different than B
2nd order: A is different from A-accent-grave
3rd order: A is different than a
4th order: The type of punctuation mark is important.
For instance:
Redwing
Red wing
Red-wing
Redwood
Red wood
Red worm
If spaces (& other punctuation) are treated as a first order difference the
list
becomes sorted as
Redwing
Redwood
Red wing
Red wood
Red worm
Red-wing
Which may be desirable, but isn't a dictionary sort.
Regards,
Ann