firebird-support - Re: [firebird-support] Re: Firebird and Unicode queries

Subject	Re: [firebird-support] Re: Firebird and Unicode queries
Author	Helen Borrie
Post date	2005-02-10T07:11:06Z

At 08:52 PM 9/02/2005 -0600, you wrote:

>The UNICODE_FSS is a different unicode than UTF-8 and UTF-16. You may
>be able to cast your strings to the FSS flavor of unicode.

In fact, it is a subset of the same superset that UTF-8 belongs to, where
character-words can vary between 2 and 6 bytes. The "FSS" bit stands for
"fixed symbol set", as I recall. Anyway, where it differs from the broader
UTF-8 implementation is that it represents all characters as 3 bytes,
whereas UTF-8 allows each character to be represented as its exact number
of bytes. In the widestring handling in some systems, this enables
compression in storage.

>I asked pretty much this same question about a month ago, and was
>referred to the firebird-architecture list for an update. Unless Ann or
>Helen has an update, so far as I know, the status is still that there
>are known issues with multibyte character sets, and the ETR is unkbnown
>but definitely after FB2.0 release.

It's true that Adriano is addressing the entire issue of MBCS, not
specifically Unicode. The objective of that - almost certainly post-Fb 2.0
- is to make the whole area of developing custom character sets and
collations much more transparent and extensible.

But don't confuse that objective with the situation regarding Unicode_FSS
support in the engine. It would be nice to say that "whatever you want to
do with Unicode, you just have to do....this....and magic will
happen". Firebird's Unicode support is not comprehensive, in the way that
some programming language implementations might claim to be. However, to
the extent that it is supported, it does work. The same applies to other
MBCS. There will be a better framework in future for extending MBCS
support but it is far from "unusable", as your advice implies.

Advising someone to store any MBCS as character set NONE will cause more
problems than it will solve. You are comfortable with having databases
bound tightly to a single application programming interface and you are
willing to forgo engine support in favour of something that works for you
in the here and now. That's *your* solution, but it's far from being *the*
solution.

In this list and, to a greater extent, in Firebird-Architect, Peter Jacobi
outlined some of the issues that need to be considered in the Unicode
context and in the wider context of MBCS. to describe them as non-trivial
is an understatement. It's also fair to say that standards - such as they
are - are not well established and don't look like setting down any time soon.

>The best choice, so far, seems to be to use no character set in defining
>the table, then save the characters in your application's native
>character set. Since no conversion is performed either inbound or
>outbound, you should be good to go ... but I would empirically test
>this.

I won't argue against empirically testing *everything*, including what
works and doesn't work in the actual locale the users will work in, using
their keyboards and settings. But "no character set" is the worst choice
in all environments except those invented by Americans, for Americans.

./hb