Subject Re[4]: [Firebird-Architect] A Fresh Look at Collations
Author Sergey Mereutsa
Hello Paul,

PR> Perhaps you have utf-fss in mind, which is 1..6 bytes although it was
PR> often implemented as 1..3 bytes (and fixed 3 bytes in FB).
No, I mean UTF-8 in it native representation.

PR> Utf-8 is 1 to 4 bytes.

No matter 2, 3 or 4 (I referred to "up to 6 bytes because 5- and 6-
bytes are possible - those marks are reserved, but not used currently"
) - this is variable number of bytes per character.

PR> Perhaps not as easy as fixed length characters, but for that the
PR> only real solution is to move to utf-32. utf-16 (as different from ucs-16)
PR> is not a solution, as utf-16 has both 2 byte and 4 byte sequences (and
PR> support for the 4 byte sequence is required in China).

This was the point - you must count all bytes to know how many
_characters_ are in the string and you can not say if address like
string[myIndex] is valid or not _withouw_ walking (in the worst case)
all _bytes_ of this string.

PR> In my perception most newly written C/C++ code uses 32 bits to represent
PR> single characters and utf-8 to represent strings. I'm not sure what the
PR> current state of play is in the Java and .net worlds. Anyone in this list
PR> with a view?

C/C++ (at least GCC) define char[] as byte array. Some (non-stangart)
classes like UTF8String allow you to manipulate with strings in UTF-8.
Since UTF-8 is safe from ASCII point of view - you can use UTF-8 for
your sources.
Java initially used UTF-8 as encoding for sources (if I remember
correctly).
PHP does not take care about sources encoding - it is programmer
responsabillity, but you must use mb_* prefixed functions when you are
working with multibyte characters.

PR> However, we are getting away from collations, which is related to
PR> encodings but a different topic really.

Yes, but it rise another question - what letter must be frst, if we
order strings alphabetically - hebrew "alef", greek "alpha", latin "A", russian
"A", bolgarian "A" or Ucranian "A" and why?


P.S. Personally, I prefer to work with UTF-8 in it`s native form (with
some exceptions, where speed is on the first place) - you do not care
about endianness and you can send it over network without any change.

--
Best regards,
Sergey mailto:serj@...