Subject Re: [firebird-support] Re: Firebird and Unicode queries
Author David Johnson
On Wed, 2005-02-09 at 18:40, Helen Borrie wrote:
>
>
> At 08:38 PM 9/02/2005 +0000, you wrote:
>
> >Is this really so??? I decided to use FB, based on the statement it
> >supports Unicode.. And it does have a UNICODE_FSS character type..
> >Do you mean it is not really supported?
>
> Of course it is supported. However, there are no collations, other than
> strictly binary, which means that there are no dictionary sort orders or
> upper/lowercase mappings, other than for the characters in the range of the
> first 128 characters (US ASCII equivalents).
>
> Also, since UNICODE_FSS stores every character as exactly 3 bytes, the
> maximum size of indexes (up to and including Fb 1.5.x) is one-third or less
> than that for single-byte character sets. In practice, this means you
> can't index string fields larger than ~84 characters - less if you use
> multi-segment indexes.
>
>
> >--- In firebird-support@yahoogroups.com, David Johnson wrote:
> > >
> > > To store utf-8 or utf-16, you should declare the column with no
> > > character set.
>
> This is not the right advice, albeit it might be a workaround that makes
> sense to Java programmers who are used to working with / prefer to take
> advantage of an application language that can manipulate unicode according
> to some idiomatic Java conventions. It will lock you into storing data
> that neither the database engine nor other application languages can make
> any sense of. It will make most string expressions unavailable. It goes
> right against the principle of storing data that is independent of the
> application languages through which it might be accessed.

UTF-8 and UTF-16 are standards that are independent of language -
programming, database, or natural. If you want to store data from
dissimilar languages in the same columns in the same database instance,
it is necessary to have a character encoding that supports all of these
at the same time.

In unicode, _all_ collation and sorting is idiomatic (although the first
120 characters do correspond to the old ASCII set, and 128 through 255
correspond to the ISO-8859 definitions). Java provides easy access to
all of the idioms in its standard library, but the UTF standards are not
limited to the Java world.

You can't get much more independent, in the modern world, than to store
data in a UTF flavor (too bad there are more than one).
>
> If you want to store unicode characters, declare the columns as
> UNICODE_FSS. If you use character set NONE (or any other single-byte
> character set) the database engine has no way at all to know that each
> character is represented by a 3-byte word.

In UTF-8 and UTF-16, the byte count is variable from 1 (or 2) to at
least 6 bytes

>
> Applications should take care of ensuring that the interface sees the right
> character images - the database engine doesn't deliver character images -
> and they should ensure that search criteria passed to the database are
> recognised as UNIICODE_FSS.

But here is where it gets sticky ... If your application is used by a
multinational firm that stores data in the same tables in all of
chinese ideograms, swedish, norwegian, russian and english, then your
character encoding (not font) must support all of these idioms.

The A with a circle on top (Angstrom to english speakers) is just an "A"
to english speakers, but it is a distinct letter between A and B in
norwegian and a distinct letter following about two places after Z in
swedish (or maybe it's the other way around). In those languages, it is
not just a funny looking A - it is a semantically distinct character
that changes the pronunciation and meaning of the words it appears in.

The unicode character 197 always translates to the Angstrom character,
regardless of language. The display of the character is the
responsibility of the GUI/text rendering engine, and the collation is
the responsibility of the localized sort engine. Furthermore, it cannot
possibly be confused with a part of a chinese ideogram in BIG5, the
greek letter Eta, or an older DOS character drawing tool.

More importantly, there is no fiddling with code pages. You can safely
mix data from different languages in a common table, and know that all
systems should (allegedly - and that is a big caveat since the unicode
standards are still relatively immature) behave correctly in regards to
the data.

>
> If you want *all* of your string fields to be stored as unicode, you should
> make UNICODE_FSS the default character set of the database and always use
> UNICODE_FSS as the lc_ctype of the client connection.
>
> If the database default charset is not UNICODE_FSS correctly paired with a
> matching client lc_ctype, but columns are arbitrarily declared as
> unicode_fss then the client work to achieve proper character recognition
> by the engine is more complicated, viz.

This is a good point.

>
> -- in non-parameterised statements, you'll have to prefix each string with
> the introducer _UNICODE_FSS, e.g.
>
> where UField = _UNICODE_FSS 'carrots'
>
> -- in parameterised statements, you will have to use a cast expression in
> the input string, e.g.
>
> where UField = cast(? as varchar(n) character set UNICODE_FSS)
>
> (where (n) is the declared size of UField in the table definition).
>
> ./hb
>
>
>
>
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>