firebird-support - RE: [firebird-support] Unicode

Subject	RE: [firebird-support] Unicode
Author	Helen Borrie
Post date	2005-04-11T01:17:09Z

At 11:13 PM 10/04/2005 +0200, you wrote:

> > Recent reviews on this and the Firebird-architect list have
> > concluded that the Unicode FSS support is based on an early
> > version of the unicode specification that does not match the
> > final documentation, and does not reflect any supported standard.
>
>Can somebody confirm that the newest release of FireBird doesn't support the
>final specification of Unicode?

What is "the final specification of Unicode"? Unicode has numerous
specifications...and Firebird doesn't support any of them
directly. Firebird's UNICODE_FSS is a superset of what is currently called
utf-8, but it's not utf-8. It doesn't support characters having variable
byte lengths. All characters in unicode_fss are 3 bytes.

> > I had poor luck saving unicode from Delphi. Others may have
> > had more success. The Delphi (UCS2) widestring does not
> > support enough code points to handle the full Unicode specification.
>
>Well as far as I know wide strings in Delphi Delphi does represent Unicode
>just fine. Wide strings contain 32 bit characters and UniCode are a 32 bit
>character set.

Not 32-bit, but 16-bit. Here's what the Delphi help says about widestring:

"The WideString type represents a dynamically allocated string of 16-bit
Unicode characters."

Out of the box, Delphi's MBCS support supports at most a two-byte character
set, which rules out using widestrings for unicode_fss.

> > However, I have been able to verify that Firebird will
> > correctly store and retrieve unicode UTF-8 characters from
> > Java using the jaybird JDBC driver, provided the character
> > set of the database/column is set to none. The caveats are
> > that your column size is a byte size, not a character size,
> > and that only byte order collation is meaningful from the
> > database. I used arabic, chinese, tibetan, english, and
> > french strings to verify this.
>
>Are Unicode represented as UTF-8 i FireBird or am I confused?

You *are* confused, but you have a right to be. There are few people
around here trying to use unicode_fss with Delphi who are not confused (me
included). Delphi giveth with one hand and taketh away with the other.

Firebird's unicode is not "all-singing, all-dancing Unicode". Beyond the
bottom line that each stored character is exactly three bytes, there is
nothing that is language-specific or locale-specific: no dictionary sort
orders, upper/lower case mappings, etc. The collation sequence is strictly
by the numerical values of the 3-byte words that represent the
characters. However, with both client and database set up with
unicode_fss, whatever characters you throw at it will be stored and
retrieved as 3-byte words.

What goes for Java doesn't go for Delphi. Java supports Unicode
internally, Delphi doesn't. Widestring is the wrong place to go in Delphi
for Unicode, other than utf-2. What you need is a data access interface
that understands the difference between the character length of a string
and its length in bytes. Borland's generic interfaces - and those that
mimic them - can't make the distinction. IBO (and some others) do make
that distinction.

If your data access interface has this capability, AND the database
connection is set up with unicode_fss as the client language AND the
default character set of the database is unicode_fss then you are half-way
there.

(I started to write a whole lot of stuff about some experiments I was doing
with UNICODE_FSS and Delphi with IB Objects a few months ago...but deleted
it. I got distracted from this exercise at the point where unicode_fss
data in Simplified Chinese was displaying properly in the IBO data-aware
controls and I think I was sending a valid search string across the API...)

I have a project on the drawing board to write a complete sample app that
demonstrates how unicode_fss (albeit with its limitations) can be made to
work with Delphi and IBO. I haven't arrived at a point where I have a
potted-up set of answers. I had breakthroughs at two points, before this
project went on the back-burner. I'll get back to it one day soon.

One UI tip you might not be aware of is that you do need to set your
data-aware controls to display non-ascii character sets. It's a given that
you must use a font that supports Unicode, of course, and the font must be
capable of supporting the particular language set you want to
display. Beyond that, you have to set the Charset property of that font to
the appropriate language locale. If you're working with a single language,
you can make these settings statically. If you are trying to use Unicode
as a way to support multiple languages (e.g. both Chinese and English) in a
single session, there will be a lot of run-time pushing and pulling to do
and - depending on how your applications are intended to work - you may
assist yourself by actually identifying on your record structures *which*
language that particular record is storing.

In some editions of Delphi, extended international character support
apparently comes as an optional extra, too.

You might like to try the TNT components as part of your exploration.

Sorry, no answers, no reduction in confusion, just some pointers to places
you could go (maybe...)

./heLen