firebird-support - Re: [firebird-support] Re: Firebird and Unicode queries

Subject	Re: [firebird-support] Re: Firebird and Unicode queries
Author	Lester Caine
Post date	2005-02-10T05:46:51Z

David Johnson wrote:

>>>--- In firebird-support@yahoogroups.com, David Johnson wrote:
>>>
>>>>To store utf-8 or utf-16, you should declare the column with no
>>>>character set.
>>
>>This is not the right advice, albeit it might be a workaround that makes

<SNIP>

> UTF-8 and UTF-16 are standards that are independent of language -
> programming, database, or natural. If you want to store data from
> dissimilar languages in the same columns in the same database instance,
> it is necessary to have a character encoding that supports all of these
> at the same time.

David - Helens comment is to the fact that using NONE to store UTF-8 is
the wrong answer.

> In UTF-8 and UTF-16, the byte count is variable from 1 (or 2) to at
> least 6 bytes

Which is the crux of the problem when trying to manage the data within a
database field. In the good old days yo could look at the binary data
and character 'x' would be a position 'x' on ALL records. UNICODE_FSS
maintains that link at the expense of 24bits per character rather than
8, but then string matching is consistent, and SUBSTRING is a simple
count of characters not needing 'context'.

> The A with a circle on top (Angstrom to english speakers) is just an "A"
> to english speakers, but it is a distinct letter between A and B in
> norwegian and a distinct letter following about two places after Z in
> swedish (or maybe it's the other way around). In those languages, it is

And you are still thinking on the small scale. I've been building up an
archive of world data, but I have hit this problem as well. Add in
multiple other languages that do not use the assci characters at all.
What SHOULD the default ordering be and how do you manage it in the
indexes - especially when you add 'full text search ' ;)

>>If you want *all* of your string fields to be stored as unicode, you should
>>make UNICODE_FSS the default character set of the database and always use
>>UNICODE_FSS as the lc_ctype of the client connection.

This is still a half way house, but may be the best we can do in
reality. While you can store UTF-8/16 easily enough, fully updating all
character functions to handle it is (to my mind) a large amount of
effort. THEN one asks the question, does providing that functionality
impinge on performance when only a simple 7bit binary set is required?

Those of us who are used to working in one dimension - English - still
have great difficulty contemplating the problems of multi dimensional
translations, and are perhaps a little jealous of those who can handle
it so easily. But hopefully these problems are being addressed by the
work being done on INTL ?

--
Lester Caine
-----------------------------
L.S.Caine Electronic Services