firebird-support - Re: [firebird-support] Writing UTF16 to the database

Subject	Re: [firebird-support] Writing UTF16 to the database
Author	Scott Morgan
Post date	2005-02-22T23:40:44Z

Brad Pepers wrote:

> Doing so would have many benefits with at least some hurdles:
>
>1. Depending on the Unicode format used internally (UTF-8/16/32) this
>could have a high penalty on string sizes.
>

I think UTF-16 is the best compromise, hell disk space is cheep so
UTF-32 is reasonable. But overall it doesn't really matter, what's
important is getting the data in and out in a useful form and the speeds
of the engine.

>2. Also doing this will likely require that there is a string class that
>uses Unicode internally and this would make the class a heavier
>implementation
>

A unicode string class isn't really that much more 'heavy' than a normal
sting class. Hardest bit is handling multi byte situations, but then, we
already have that with many of the existing encodings.
There is a UCS2 (UTF-16) implemntation in the source
(src/intl/lc_unicode_ucs2.c) which is used internally for...

>3. You would need a new system to convert from/to Unicode from supported
>character sets so it would largely replace the existing collation
>sequences and such which is a large job I'm sure.
>
>

When transcoding from one set to another, if there isn't a direct route
available, the text is transcoded to UCS2 and then to the target encoding.

http://www.ibphoenix.com/main.nfs?a=ibphoenix&l=;PAGES;NAME='ibp_collation'
(section titled 'Two Conversion Objects')

>4. Comparison and collation with Unicode is not a trivial problem to
>solve.
>
>

Collation is a PITA no matter what (it's worrying how many devs I've met
who are totally ignorant of the various cultural diffrences in text
handling, not least of which is sort orders). But there is an open
source project that can, at the very least, help.

http://icu.sourceforge.net/

IIRC the dev team are aware of this project and plan to use it.

>5. If the key of an index is a string, how is it handled now? I was
>under the impression it was stored in a binary format that is expected
>to compare properly using a straight binary comparison.
>

Although there are many ways you can encode certain glyphs in the
various unicode systems, there are standards to normalise them which
sould enable binary matching.

http://www.unicode.org/reports/tr15/

Scott