Subject | Re: [Firebird-Architect] A Fresh Look at Collations |
---|---|
Author | Ann W. Harrison |
Post date | 2010-06-21T15:24:03Z |
Sergey Mereutsa wrote:
counting the characters in a string is true. On the other hand, strings
in SQL are either CHAR, in which case their length is the declared field
length, or VARCHAR, in which the number of significant characters is
stored in the field itself. In NimbusDB - Jim's current project -
string always start with their byte length, which might seem like a
problem for variable length characters, particularly if you want to
start an equality comparison by comparing the lengths of the strings.
Two identical strings will have the same byte length and the same
character length, so it doesn't matter.
The advantages of UTF_8 over a fixed two-byte character representation
are first that UTF_8 can represent all glyphs in all human languages,
and second that the most commonly used characters - spaces and numbers -
are a single byte. I read somewhere recently that Japanese strings
are shorter in UTF_8 than in a fixed two-byte character set designed
specifically to represent Japanese - though of course, I've lost the
citation.
Cheers,
Ann
>I think UTF-8 characters never go beyond 4 bytes, but your point about
> Because you can not count string length, for example, without walking
> it all - because each char in UTF8 (if we speak about it native
> representation) can be from 1 to 6 bytes length.
counting the characters in a string is true. On the other hand, strings
in SQL are either CHAR, in which case their length is the declared field
length, or VARCHAR, in which the number of significant characters is
stored in the field itself. In NimbusDB - Jim's current project -
string always start with their byte length, which might seem like a
problem for variable length characters, particularly if you want to
start an equality comparison by comparing the lengths of the strings.
Two identical strings will have the same byte length and the same
character length, so it doesn't matter.
The advantages of UTF_8 over a fixed two-byte character representation
are first that UTF_8 can represent all glyphs in all human languages,
and second that the most commonly used characters - spaces and numbers -
are a single byte. I read somewhere recently that Japanese strings
are shorter in UTF_8 than in a fixed two-byte character set designed
specifically to represent Japanese - though of course, I've lost the
citation.
Cheers,
Ann