Subject | Re: Re[2]: [Firebird-Architect] A Fresh Look at Collations |
---|---|
Author | Paul Ruizendaal |
Post date | 2010-06-21T15:42:55Z |
> AWH> Why do you say UTF-8 is slow?Sergey.
>
> Because you can not count string length, for example, without walking
> it all - because each char in UTF8 (if we speak about it native
> representation) can be from 1 to 6 bytes length. If I misunderstood
> your point - sorry.
Perhaps you have utf-fss in mind, which is 1..6 bytes although it was
often implemented as 1..3 bytes (and fixed 3 bytes in FB).
Utf-8 is 1 to 4 bytes. Indeed Greek and Cyrillic letters require 2 bytes
(instead of 1 using a code page), and many Asian languages require 3 bytes,
instead of 2 as with UCS-16. With current memory and disk sizes, I don't
think this is a big problem anymore. Interestingly, if a text is a mix of
ascii and non-ascii (e.g. a html doc with ascii html tags and Japanese
content), the utf-8 version tends to be smaller than the UCS-16 equivalent.
Counting string length in bytes takes just as long in ascii as it does in
utf-8: in both cases the string is walked until a terminating zero is
found. Counting characters ("code points") is indeed a bit harder as only
the bytes <0x7f signify new characters, and bytes >0x80 are follow-up
bytes. Perhaps not as easy as fixed length characters, but for that the
only real solution is to move to utf-32. utf-16 (as different from ucs-16)
is not a solution, as utf-16 has both 2 byte and 4 byte sequences (and
support for the 4 byte sequence is required in China).
In my perception most newly written C/C++ code uses 32 bits to represent
single characters and utf-8 to represent strings. I'm not sure what the
current state of play is in the Java and .net worlds. Anyone in this list
with a view?
However, we are getting away from collations, which is related to
encodings but a different topic really.
Regards,
Paul