firebird-architect - Re: [Firebird-Architect] UTF-8 and Compression

Subject	Re: [Firebird-Architect] UTF-8 and Compression
Author	Pavel Cisar
Post date	2005-03-01T08:43:46Z

Dimitry Sibiryakov wrote:

> On 1 Mar 2005 at 8:22, Pavel Cisar wrote:
>
>
>>Pardon my ignorance, but doesn't the decoupling of the character set
>>from collation cause problems? I mean that collation would handle only
>>a subset of UTF-8 for particular charset.
>
> If we have a universal character set we can also have a universal
> collation and special collations for some languages. I mean a base
> collation class that handle everything (including Czech and Chinese)
> and some derived classes for languages with special characters
> ordering/processing.

Well, but that's implementation detail to handle that internally without
error, not a real solution for end user problem, thought.

>>would break that, wouldn't be possible to store for example Czech and
>>Chinese characters in field and then ask for Czech collation which
>>wouldn't handle Chinese with odd results? Well, Czech&Chinese example
>>is a little bit stretched, but you see the point.
>
> Because Czech and Chinese characters don't cross they can be
> handled independently. Do you see a problem if Czech and Chinese
> strings are sorted properly but all Chinese is placed after all
> Czech?
> This case is not different from current situation when russian (and
> I quess Czech) characters are sorted after all latin.

It depends. It *could* be both, correct or wrong according to
application needs. Storing data in wrong charset (i.e. other than
designed and expected by application) is an error. If I declare field to
store data in specified charset, I would expect that anything other
would raise an error. If I would want field to store multiple charsets,
I would declare it as UNICODE_FSS. Internally, it could be all handled
in single charset (UTF-8), but distinction is still needed at logical
level. It's a handy data validation tool for developers. I have nothing
against the idea to handle all in UTF-8 internally, in fact I think
that's A Good Idea(tm). But from Jim's description (especially about
about *new API*) I had an impression that he suggest to make charset
conversion validation somewhat *relaxed* (i.e. make a shortcut from
client charset right to the underlying UTF-8 which is possible thanks to
charset/collation decoupling). The problem would manifest himself to
user by unexpected ordering (or whatever) according to collation used,
but in fact it's a charset validation/conversion issue.

Best regards
--Pavel