Subject Re: [Firebird-Architect] A Fresh Look at Collations
Author Jim Starkey
Sergey Mereutsa wrote:
> Hello Jim,
>
> JS> 1. The database engine itself is strictly utf8 only. Character set
>
> I`m not an expert, but from my expirience, working in pure UTF8 is not
> a good idea - it is slow. May be any binary Unicode format is better?
> Or speed is not a goal at all?
>
> P.S. All our texts are in UTF8 (because of 2 languages required by
> default - romanian and russian). Sometimes it is a pain.
>
>
>
It's a multi-national world, hence Unicode. And utf8 because it's
denser and doesn't suffer endian problems of either utf16 or Unicode.

I don't think performance is a significant issue. Sorts and indexes get
resolved to byte streams. Depending on character set, utf8 strings are
longer than most national character sets, but by less than a factor of
two since punctuation, digits, and spaces are all single byte
characters. Finally, the cost of turning utf8 into Unicode is small and
fast:

static inline uint getUnicodeChar(const char*& p)
{
UCHAR c = *p++;
int len = utf8Lengths[c];
uint code = utf8Values[c];

for (; len > 1; --len)
code = (code << 6) | (*p++ & 0x3f);

return code;
}


--
Jim Starkey
Founder, NimbusDB, Inc.
978 526-1376



[Non-text portions of this message have been removed]