firebird-support - Re: cannot transliterate between character between character sets

Subject	Re: cannot transliterate between character between character sets
Author	peter_jacobi.rm
Post date	2004-04-28T14:47:03Z

Hi Terry, All,

UNICODE support in Firebird is mixed issue and gives
interesting times (in the sense of the chinese proverb)
to developers.

Terry Johnson <terry@s...> wrote:

> How about UCS4 data? How do you go about entering that? Is that still
> managed as a string entry?

There is no direct support for non-BMP characters, so when
you are asking about UCS4 because your dire need to support
Linear B or Byzantine Musical Notation (or more likely GB18030),
you are somewhat out of luck. (Don't stop reading yet).

Also the feature of automatic character set conversion
by Firebird itself, turns into a bug, if a charset conversion
is called for, which changes the byte length of the string.

So you are essentially left with two models:

a) Have some fixed database character set and use it also
as your connection charset

b) Use all the funny charsets you need in the database, connect
using charset NONE, and have necessary charset conversions
in your software (are middle layer, like .NET provider).
Requires FB 1.5.1

So, again to the question of storing your UCS-4 character data:

I see these options:

A) Use char (4*N) character set OCTETS (FB's BITSTRING) to store
the UCS-4 unchanged. Better store it big endian. Not pretty.

B) Store as UTF16BE, using the fbintl DLL from pjcolkit:
http://www.jodelpeter.de/i18n/fbarch/index.htm
Untested

C) Store UTF-8 in fields declared charset NONE. Ugly, but works.

D) Use UNICODE_FSS, but if you really have non-BMP chars,
better don't store the UTF-8 form, but CESU-8
http://www.unicode.org/reports/tr26/
Should work. Sort of. Feedback welcome.

As you can see from the list, it's a rather awkward
choice, but pragmatically speaking each of these options
will work, only none of them gets an award for clean design.

Also note that options A) and B) defeat the RLE compression
for stored data and may be ineffective for this reason.

Regards,
Peter Jacobi