firebird-support - Re: [firebird-support] Writing UTF16 to the database

Subject	Re: [firebird-support] Writing UTF16 to the database
Author	Ann W. Harrison
Post date	2005-02-22T19:49:16Z

Brad Pepers wrote:

>
> 1. Depending on the Unicode format used internally (UTF-8/16/32) this
> could have a high penalty on string sizes. Its not too bad with UTF-8
> but it will have at least a slight over-head as compared to a string
> stored in an exact encoding (for example accented characters will take 2
> bytes with UTF-8 as compared to 1 with an encoding supporting the
> accented character directly).

Right. There's no extra cost for ASCII, some extra cost for accented
Latin characters say a 15% increase in string size since most
characters in most language that use the Latin alphabet are unaccented.
However, non-Latin phonetic alphabets - Greek, Cyrillic, Arabic -
currently use one byte representation, so the string size would double.

Some of that could be offset with a better string compression algorithm
for data going to disk. And, if all strings used the same character
representation, we could probably afford a serious compression
algorithm, since there would be only one.

>
> 2. Also doing this will likely require that there is a string class that
> uses Unicode internally and this would make the class a heavier
> implementation that just a more simple wrapper around a char*. I
> suspect once such a string class is created, it should be used in all
> areas of Firebird rather than having two separate string classes
> depending on whether you currently think you need Unicode support or not
> but this would have to be thought out more.

My thinking at the moment is that having a dozen different character
representations possible in the database makes all string functions
horribly difficult. Some string functions are collation dependent and
will be hard regardless (e.g. upper/lower), but some (substring, strlen,
find the offset of this character, etc.) could be done once, in the
engine rather than once per collation.

At the time the current internationalization was designed, there was a
requirement to match all the quirks of various other products' support
for international characters. I'm not sure providing a dozen or more
storage formats is the best way to do that today. Among the results is
that there are several implementations of collations, which seems like a
waste of effort.

>
> 3. You would need a new system to convert from/to Unicode from supported
> character sets so it would largely replace the existing collation
> sequences and such which is a large job I'm sure.

Each current character set includes a conversion to and from Unicode,
which is a starting place. The conversion is currently used for
comparisons. It could as easily be used as a conversion between stored
and external formats.

>
> 4. Comparison and collation with Unicode is not a trivial problem to
> solve. I'm not sure if there is any code out there to already do part
> of this but its quite complicated though with potential benefits if its
> done right such as having case-less comparisons and being able to handle
> any mix of character sets properly.

Equality comparisons are reasonably straight-forward, as are functions
that break a string based on length or the existence of an explicit
pattern. Of course, we currently need different implementations for
different character representations. In this case, having one stored
representation would be simpler.

For the rest, we're already struggling. Having lots of representations
makes it harder to handle collation sensitive operations consistently
and correctly - or at least it's harder for me to think about the
problem - how the people who actually write the code cope, I do no know.

>
> 5. If the key of an index is a string, how is it handled now? I was
> under the impression it was stored in a binary format that is expected
> to compare properly using a straight binary comparison.

String data is stored in the character representation defined for the
column.

Indexed fields are reconstructed before being stored in an index key.
The code that implements the character set / collation pair defines a
transformation from the character representation to a representation
that sorts correctly bytewise. The transformed value is stored as the
index key. That transformation usually increase the size of characters
to two or three bytes, depending on the complexity of the ordering.

The code for the character set / collation pair also defines comparisons
for less, greater, and equal that produce (in theory) the same results
as the index.

Regards,

Ann