Subject RE: [firebird-support] Unicode
Author David Johnson
On Mon, 2005-04-11 at 11:17 +1000, Helen Borrie wrote:
>
> At 11:13 PM 10/04/2005 +0200, you wrote:
>
> > > Recent reviews on this and the Firebird-architect list have
> > > concluded that the Unicode FSS support is based on an early
> > > version of the unicode specification that does not match the
> > > final documentation, and does not reflect any supported standard.
> >
> >Can somebody confirm that the newest release of FireBird doesn't support the
> >final specification of Unicode?
>
> What is "the final specification of Unicode"? Unicode has numerous
> specifications...and Firebird doesn't support any of them
> directly. Firebird's UNICODE_FSS is a superset of what is currently called
> utf-8, but it's not utf-8. It doesn't support characters having variable
> byte lengths. All characters in unicode_fss are 3 bytes.
>
>

More properly it is a re-encoding of UTF, based on a preliminary draft
of the UTF-8 specification, that was intended to be UNIX friendly.
See posting by David Schnepper: http://groups.yahoo.com/group/Firebird-
Architect/message/4815


> > > However, I have been able to verify that Firebird will
> > > correctly store and retrieve unicode UTF-8 characters from
> > > Java using the jaybird JDBC driver, provided the character
> > > set of the database/column is set to none. The caveats are
> > > that your column size is a byte size, not a character size,
> > > and that only byte order collation is meaningful from the
> > > database. I used arabic, chinese, tibetan, english, and
> > > french strings to verify this.
> >
> >Are Unicode represented as UTF-8 i FireBird or am I confused?
>

Firebird's Unicode is not UTF-8. The FSS encoding is based on an early
UTF-8 draft, intended to be friendly UNIX file systems. But it is a
distinctly different creature from the current incarnation of UTF-8.

For example, a UTF-8 character may occupy from 1 to 6 bytes. A UNICODE
FSS character will always occupy precisely three bytes. As far as I can
tell, Unicode FSS is not understood by very many tools at all. UTF-8,
on the other hand, is understood by many tools. Most current generation
*NIX operating systems are built around UTF-8 (not ASCII).

I believe that it would be possible for conversion code to be easily
written (or the existing drivers to be fixed). But, I also believe
that there are a number of architectural issues to be resolved.

I am a proponent of making the underlying datastore all UTF-8, so code
page conversions become the realm of the session. ASCII will still be
stored as single byte characters, but you can support English, Arabic,
Chinese, etc without changing your core database engine code. The
sticking points mostly require separation of some key things that we
have lumped together in the past:

1. Separation of character set from localization
2. Separation of string byte length from string character length
3. Recognition that Anglo-centric computing is not a long term solution
for most new markets

> What goes for Java doesn't go for Delphi. Java supports Unicode
> internally, Delphi doesn't. Widestring is the wrong place to go in Delphi
> for Unicode, other than utf-2. What you need is a data access interface
> that understands the difference between the character length of a string
> and its length in bytes. Borland's generic interfaces - and those that
> mimic them - can't make the distinction. IBO (and some others) do make
> that distinction.
>

Dead on Helen. This is exactly why, when I first started trying to
write an internationalized app (ie. one that operated inside one
organization that crossed many linguistic and cultural boundaries, and
must support all of the regions in real time concurrently) I quickly
shifted to the Java world.

There are C libraries for handling UTF-8, Java is natively UTF-8, I
believe that Python is UTF-8 (I use the Java implementation of Python so
it doesn't count).

> Sorry, no answers, no reduction in confusion, just some pointers to places
> you could go (maybe...)
>
> ./heLen

The Intl branch of Firebird has extensive UTF-8 support. However, I
don't know how close they are to a releasable product. Helen or Ann
probably know more about their status than I do.

Hope this helps, and good luck!