firebird-architect - Re: [Firebird-Architect] Writing UTF16 to the database

Subject	Re: [Firebird-Architect] Writing UTF16 to the database
Author	Jim Starkey
Post date	2005-03-01T15:11:43Z

Adriano dos Santos Fernandes wrote:

>The "brilliant" proposal has missing some things.
>
>For *all* current FB users (who that don't depent on broken UNICODE_FSS)
>will cause these problems:
>
>1) Slow (because conversions will always be made).
>2) Break all UDFs that use strings.
>3) Break all external tables.
>4) Don't simplify the engine because conversions between charsets is
>done like conversions between numbers and strings.
>5) Waste of disk space and memory.
>
>(2) and (3) will make users to say: "Where this garbage come from?"
>
>
>

Let's take these one at time.

1) Slow: The tiniest fraction of machine cycles are used moving and
comparing use data strings. Performance is limited by more frequent,
more expensive operations. For sorting and indexing, which are high
frequence operations, all characters strings are transformed into
naturally collating byte sequences any, so there's no change here. For
assignments, UTF-8 can use a few more bytes. For equality comparisons,
the difference is insignificant. For signed comparisons, the only
necessary analysis is at the point of inequality. And these are
assuming that both operands are of the same type. In the current
system, if operands are of different character sets, both must be
converted to a common character set. With universal internal UTF-8, the
conversion case doesn't exist.

There will cases of where a uniform non-ascii character set may be
marginally more efficient than UTF-8. When mixed characters sets are
used, internal UTF-8 will be significantly faster.

2) UDFs: If we can handle "UNICODE-FSS" now, I think we can find to way
to handle UTF-8 without breaking anything.

3) External Tables: I can't see a problem, though data will have to be
converted to UTF-8 when it enters the engine and vice versa.

4) Engine complexity: Having encapsulated the mover into a C++ class, I
promise you the code will be much, much simpler. It will not necessary
to care around character set information, intermediate conversions will
not be necessary, and when an explicit collation is present, it is
represented by a simple object pointer resolved at compile time.

5) Memory utilization: What is the basis for the concern? Is it the
odd character for Latin characters sets that requires an extra byte?
Can you estimate a "bloat factor" for typical records?

--

Jim Starkey
Netfrastructure, Inc.
978 526-1376