Subject RE: [firebird-support] Unicode
Author Svend Meyland Nicolaisen
> -----Original Message-----
> From: Helen Borrie [mailto:helebor@...]
> Sent: 11. april 2005 03:17
> To:
> Subject: RE: [firebird-support] Unicode
> At 11:13 PM 10/04/2005 +0200, you wrote:
> > > Recent reviews on this and the Firebird-architect list have
> > > concluded that the Unicode FSS support is based on an
> early version
> > > of the unicode specification that does not match the final
> > > documentation, and does not reflect any supported standard.
> >
> >Can somebody confirm that the newest release of FireBird doesn't
> >support the final specification of Unicode?
> What is "the final specification of Unicode"? Unicode has
> numerous specifications...and Firebird doesn't support any of
> them directly.

By "final specification" I meant "latest specification version".

> Firebird's UNICODE_FSS is a superset of what
> is currently called utf-8, but it's not utf-8. It doesn't
> support characters having variable byte lengths. All
> characters in unicode_fss are 3 bytes.

OK, so UNICODE_FSS should be encoded as UTF-8 with the exception that all
characters are 3 bytes long.

This means that the character 0x007f are encoded as 0x00, 0x00, 0x7F and the
character 0x0100 are encoded as 0x00, 0x02, 0x80?

The question might seem trivial but I'm not very familiar with either

> > > I had poor luck saving unicode from Delphi. Others may have had
> > > more success. The Delphi (UCS2) widestring does not
> support enough
> > > code points to handle the full Unicode specification.
> >
> >Well as far as I know wide strings in Delphi Delphi does represent
> >Unicode just fine. Wide strings contain 32 bit characters
> and UniCode
> >are a 32 bit character set.
> Not 32-bit, but 16-bit. Here's what the Delphi help says
> about widestring:
> "The WideString type represents a dynamically allocated
> string of 16-bit Unicode characters."

Yes,16-bits of course. When do I learn to go to bed when I'm too tired? ;-)

> Out of the box, Delphi's MBCS support supports at most a
> two-byte character set, which rules out using widestrings for
> unicode_fss.

Well it should be possible to use widestrings with FireBird as long as you
convert the data when reading and writing from/to FireBird. As widestrings
contain 16-bit characters it must have enough code points to represent the
complete Unicode character set.

I am writing my own "data access library" which uses the FireBird API
directly so I can make all the conversions that I need.

> > > However, I have been able to verify that Firebird will
> > > correctly store and retrieve unicode UTF-8 characters from
> > > Java using the jaybird JDBC driver, provided the character
> > > set of the database/column is set to none. The caveats are
> > > that your column size is a byte size, not a character size,
> > > and that only byte order collation is meaningful from the
> > > database. I used arabic, chinese, tibetan, english, and
> > > french strings to verify this.
> >
> >Are Unicode represented as UTF-8 i FireBird or am I confused?
> You *are* confused, but you have a right to be. There are few people
> around here trying to use unicode_fss with Delphi who are not
> confused (me
> included). Delphi giveth with one hand and taketh away with the other.
> Firebird's unicode is not "all-singing, all-dancing Unicode".
> Beyond the
> bottom line that each stored character is exactly three
> bytes, there is
> nothing that is language-specific or locale-specific: no
> dictionary sort
> orders, upper/lower case mappings, etc. The collation
> sequence is strictly
> by the numerical values of the 3-byte words that represent the
> characters. However, with both client and database set up with
> unicode_fss, whatever characters you throw at it will be stored and
> retrieved as 3-byte words.

What will it take to make FireBird implement correct dictionary sort orders
and upper/lower case mappings? There must be some fundamental problem since
it hasn't been done yet or what?

Where do I find documentation on how to implement collations in FireBird
and/or how strings are handled in FireBird?

> What goes for Java doesn't go for Delphi. Java supports Unicode
> internally, Delphi doesn't. Widestring is the wrong place to
> go in Delphi
> for Unicode, other than utf-2. What you need is a data
> access interface
> that understands the difference between the character length
> of a string
> and its length in bytes. Borland's generic interfaces - and
> those that
> mimic them - can't make the distinction. IBO (and some
> others) do make
> that distinction.
> If your data access interface has this capability, AND the database
> connection is set up with unicode_fss as the client language AND the
> default character set of the database is unicode_fss then you
> are half-way
> there.

I have no problems representing Unicode with widestrings in other
applications so as I've said with the correct character set representation
conversion it must be possible to use widestrings with FireBird.

> (I started to write a whole lot of stuff about some
> experiments I was doing
> with UNICODE_FSS and Delphi with IB Objects a few months
> ago...but deleted
> it. I got distracted from this exercise at the point where
> unicode_fss
> data in Simplified Chinese was displaying properly in the IBO
> data-aware
> controls and I think I was sending a valid search string
> across the API...)
> I have a project on the drawing board to write a complete
> sample app that
> demonstrates how unicode_fss (albeit with its limitations)
> can be made to
> work with Delphi and IBO. I haven't arrived at a point where
> I have a
> potted-up set of answers. I had breakthroughs at two points,
> before this
> project went on the back-burner. I'll get back to it one day soon.
> One UI tip you might not be aware of is that you do need to set your
> data-aware controls to display non-ascii character sets.
> It's a given that
> you must use a font that supports Unicode, of course, and the
> font must be
> capable of supporting the particular language set you want to
> display. Beyond that, you have to set the Charset property
> of that font to
> the appropriate language locale. If you're working with a
> single language,
> you can make these settings statically. If you are trying to
> use Unicode
> as a way to support multiple languages (e.g. both Chinese and
> English) in a
> single session, there will be a lot of run-time pushing and
> pulling to do
> and - depending on how your applications are intended to work
> - you may
> assist yourself by actually identifying on your record
> structures *which*
> language that particular record is storing.
> In some editions of Delphi, extended international character support
> apparently comes as an optional extra, too.

Well I am aware of all the pit-falls that the Delphi UI and the fonts (can)
give when trying to display Unicode strings. The fact that Delphi's UI only
supports 8 bit characters, and only to some extend multi byte character
sets, does make development somewhat more difficult, but it can be made to
work if you can live with the limit that you only can use one 8-bit codepage
"per string" you want to display. Using Windows API functions, like
DrawTextW, unicode/widestrings can be displayed directly without fiddling
with character set and/or code pages.