Subject | Re: [Firebird-Architect] A Fresh Look at Collations |
---|---|
Author | Jim Starkey |
Post date | 2010-06-21T19:38:54Z |
Sergey Mereutsa wrote:
(string[myIndex] & 0xCO) == 0x80
or, probably better, or
utf8Lengths[string[myIndex]] == 0
(table utf8Lengths available on request). But if you want a substring
starting with the n-th character, you will have to start counting from
the front.
use utf16. This should be confusing, but for most of the non-Asian
world, 16 bit Unicode is a subset of utf16.
well, the default Unicode collation. But it is only the default. A
collation designed can implement any rules he or she wanted to.
The hard part of collation is not how base characters are collated but
how accents, case, ligatures (characters that collate as two), and
character sequences that collate as a single character. Default
collations order by base characters first, accents if the base
characters are equal, then if the accents are also equal, case. To make
things exciting, the default collation specifies accents in the string
order while French specifies accents in reverse string order. And then
there are the countries that go out of the way to define a national
collating sequence arbitrarily different from their least-favor
neighbors. It should be obvious at this point that a universally
acceptable default collation sequence is folly.
Jim Starkey
Founder, NimbusDB, Inc.
978 526-1376
> PR> Perhaps not as easy as fixed length characters, but for that theThat's not true. A point in a utf8 string is valid unless
> PR> only real solution is to move to utf-32. utf-16 (as different from ucs-16)
> PR> is not a solution, as utf-16 has both 2 byte and 4 byte sequences (and
> PR> support for the 4 byte sequence is required in China).
>
> This was the point - you must count all bytes to know how many
> _characters_ are in the string and you can not say if address like
> string[myIndex] is valid or not _withouw_ walking (in the worst case)
> all _bytes_ of this string.
>
(string[myIndex] & 0xCO) == 0x80
or, probably better, or
utf8Lengths[string[myIndex]] == 0
(table utf8Lengths available on request). But if you want a substring
starting with the n-th character, you will have to start counting from
the front.
> PR> In my perception most newly written C/C++ code uses 32 bits to representI think you mean that ASCII is a proper subset of utf8.
> PR> single characters and utf-8 to represent strings. I'm not sure what the
> PR> current state of play is in the Java and .net worlds. Anyone in this list
> PR> with a view?
>
> C/C++ (at least GCC) define char[] as byte array. Some (non-stangart)
> classes like UTF8String allow you to manipulate with strings in UTF-8.
> Since UTF-8 is safe from ASCII point of view - you can use UTF-8 for
> your sources.
>
> Java initially used UTF-8 as encoding for sources (if I rememberIncorrect. Java originally used 16 bit Unicode. Some classes, however,
> correctly).
> PHP does not take care about sources encoding - it is programmer
> responsabillity, but you must use mb_* prefixed functions when you are
> working with multibyte characters.
>
use utf16. This should be confusing, but for most of the non-Asian
world, 16 bit Unicode is a subset of utf16.
> PR> However, we are getting away from collations, which is related toThere is a Default Unicode Collation Element Table (DUCET) that defines,
> PR> encodings but a different topic really.
>
> Yes, but it rise another question - what letter must be frst, if we
> order strings alphabetically - hebrew "alef", greek "alpha", latin "A", russian
> "A", bolgarian "A" or Ucranian "A" and why?
>
well, the default Unicode collation. But it is only the default. A
collation designed can implement any rules he or she wanted to.
The hard part of collation is not how base characters are collated but
how accents, case, ligatures (characters that collate as two), and
character sequences that collate as a single character. Default
collations order by base characters first, accents if the base
characters are equal, then if the accents are also equal, case. To make
things exciting, the default collation specifies accents in the string
order while French specifies accents in reverse string order. And then
there are the countries that go out of the way to define a national
collating sequence arbitrarily different from their least-favor
neighbors. It should be obvious at this point that a universally
acceptable default collation sequence is folly.
>--
> P.S. Personally, I prefer to work with UTF-8 in it`s native form (with
> some exceptions, where speed is on the first place) - you do not care
> about endianness and you can send it over network without any change.
>
>
Jim Starkey
Founder, NimbusDB, Inc.
978 526-1376