Subject Re: [firebird-support] Re: Firebird and Unicode queries
Author Helen Borrie
At 12:47 PM 10/02/2005 +0000, you wrote:


>Hi all,
>
>First thanks for all suggestions. I hope that this discussion is
>useful for others too...
>
>But I'm still confused. I've been playing around with all options,
>and never had success with Unicode, the way I see it..
>Here are some points I try to solve:
>
>1. Setting the default charset to UNICODE_FSS looks for me big waste
>of resources (space, speed, indexing, etc). Usually a table needs
>only few fields with MBCS characters.. I prefer to have only few
>columns defined with this charset.

In that case, it gets down to extra effort for columns that don't match the
default charset. (I hope you understand that the default charset only
affects character types and text blobs...)

>2. Using CHARSET NONE is problematic, I think, as I must provide open
>(in this case ODBC) access to the data - only INSERTS and some
>SELECTS are handled by the client GUI, data-mining should be done
>with other application.

I don't quite understand what you are getting at. ODBC access would be
done using a Firebird driver, which interfaces the Firebird API with the
ODBC layer.

>3. Setting the connection type to UNICODE_FSS looks also a problem -
>how this affects the data, passed into non-Unicode fields? Any
>overhead?

Forget "overhead". Please try to get your head around the purpose of
lc_ctype (see my reply to Scott Morgan). The lc_ctype has to match the
default charset of the database. Period.

>4. The best solution would be, if I can somehow pass and retrieve
>Unicode data on the fly, using only SQL syntax for this. Helen
>proposed the use of "where UField = _UNICODE_FSS 'carrots'" - I
>haven't yet tested this.

If you are going to store mixed character sets in the same database, you'll
need either that (introducer) syntax or CASTs for all columns whose
charsets aren't in the default charset of the database. This isn't
applicable just to MBCS - it applies in all cases where you are mixing
charsets.

>5. How the translation of utf-16 (Windows default, for example) & utf-
>8 to Unicode_FSS is to be handled? Does the engine take care of this?

Yes. See the reply to Scott Morgan.

>The utf-8/16 formats are not fixed size!! The size may vary, so there
>must be some piece of code somewhere to handle the transation to the
>fixed 3 bytes format.

There is. But it's pretty distant likelihood that UNICODE_FSS will have a
mapping for any UTF words of greater than 3 bytes. Unicode wouldn't be the
ideal choice for the charset of the system locale. It would normally make
better sense to choose an OEM character set. If you have an application
that needs to cater for all possible character images on earth, you will
definitely have a problem or two!

>6. I see very strange behaviour now, having the following setup:
>- Few columns defined as UNICODE_FSS
>- Client connects with UNICODE_FSS in the connection string, via ODBC
>- Data is inserted as-is from the GUI (supporting Unicode, under
>Windows)
>When I try to do a select with Unicode characters, I get ALL rows
>with unicode values... I saw similar description of this behavior in
>other forums/sites..
>What I do wrong?

First, with UNICODE_FSS in the connection string and the default charset in
the database something else, you have a mismatch.

Second, all search criteria will be passed in the default charset unless
you use an introducer (for literal searches) or cast (for parameterised
searches). So, if your default charset is NONE, and the characters passed
from the UI are anything but those represented by ascii codes < 128, a
garbage string will be passed. Throwing an illicit UNICODE_FSS lc_ctype
into that mix would simply make things more colourful.

>7. Firebird seems to be somehow removing Unicode information from
>English or other local language.. Then, when returned back fro the
>DB, it gets as ASCII - the localized language is one-byte encoded..

If you pass a string of single byte characters into a charset NONE column,
they will be stored as ASCII. Ascii and Ansi characters up to decimal 127
and any Unicode characters equivalent to these characters are the
same. Windows (at least, don't know for sure about Linux) will knock them
down to single-byte regardless. If any of those strings then are to be
stored in a column defined with a single-byte charset, that's how Firebird
stores them. If they are to be stored in a MBCS column, and the bridging
is correct, then Firebird stores them as 3-byte unicode.

From Firebird 1.5.1 forward, a change was made so that a string passed
across a charset NONE connection to a column in the default charset of the
database would get bridged automatically. See "Character set NONE data now
accepted 'as-is'' on p.2 of the v.1.5.1 release notes (which you can pick
up on the Firebird 1.5 main downloads page at www.ibphoenix.com). This can
be of help in some mixed charset databases.

./hb