Subject Re: [Firebird-Architect] A Fresh Look at Collations
Author Jim Starkey
Sergey Mereutsa wrote:
> PR> Perhaps not as easy as fixed length characters, but for that the
> PR> only real solution is to move to utf-32. utf-16 (as different from ucs-16)
> PR> is not a solution, as utf-16 has both 2 byte and 4 byte sequences (and
> PR> support for the 4 byte sequence is required in China).
>
> This was the point - you must count all bytes to know how many
> _characters_ are in the string and you can not say if address like
> string[myIndex] is valid or not _withouw_ walking (in the worst case)
> all _bytes_ of this string.
>
That's not true. A point in a utf8 string is valid unless

(string[myIndex] & 0xCO) == 0x80

or, probably better, or

utf8Lengths[string[myIndex]] == 0

(table utf8Lengths available on request). But if you want a substring
starting with the n-th character, you will have to start counting from
the front.

> PR> In my perception most newly written C/C++ code uses 32 bits to represent
> PR> single characters and utf-8 to represent strings. I'm not sure what the
> PR> current state of play is in the Java and .net worlds. Anyone in this list
> PR> with a view?
>
> C/C++ (at least GCC) define char[] as byte array. Some (non-stangart)
> classes like UTF8String allow you to manipulate with strings in UTF-8.
> Since UTF-8 is safe from ASCII point of view - you can use UTF-8 for
> your sources.
>
I think you mean that ASCII is a proper subset of utf8.
> Java initially used UTF-8 as encoding for sources (if I remember
> correctly).
> PHP does not take care about sources encoding - it is programmer
> responsabillity, but you must use mb_* prefixed functions when you are
> working with multibyte characters.
>
Incorrect. Java originally used 16 bit Unicode. Some classes, however,
use utf16. This should be confusing, but for most of the non-Asian
world, 16 bit Unicode is a subset of utf16.
> PR> However, we are getting away from collations, which is related to
> PR> encodings but a different topic really.
>
> Yes, but it rise another question - what letter must be frst, if we
> order strings alphabetically - hebrew "alef", greek "alpha", latin "A", russian
> "A", bolgarian "A" or Ucranian "A" and why?
>
There is a Default Unicode Collation Element Table (DUCET) that defines,
well, the default Unicode collation. But it is only the default. A
collation designed can implement any rules he or she wanted to.

The hard part of collation is not how base characters are collated but
how accents, case, ligatures (characters that collate as two), and
character sequences that collate as a single character. Default
collations order by base characters first, accents if the base
characters are equal, then if the accents are also equal, case. To make
things exciting, the default collation specifies accents in the string
order while French specifies accents in reverse string order. And then
there are the countries that go out of the way to define a national
collating sequence arbitrarily different from their least-favor
neighbors. It should be obvious at this point that a universally
acceptable default collation sequence is folly.

>
> P.S. Personally, I prefer to work with UTF-8 in it`s native form (with
> some exceptions, where speed is on the first place) - you do not care
> about endianness and you can send it over network without any change.
>
>


--
Jim Starkey
Founder, NimbusDB, Inc.
978 526-1376