firebird-architect - Re: [Firebird-Architect] UTF-8 and Compression

Subject	Re: [Firebird-Architect] UTF-8 and Compression
Author	Olivier Mascia
Post date	2005-03-01T08:37:45Z

Le 01-mars-05, à 08:22, Pavel Cisar a écrit :

> Pardon my ignorance, but doesn't the decoupling of the character set
> from collation cause problems? I mean that collation would handle only
> a
> subset of UTF-8 for particular charset. As long as collation is bound
> to
> charset, it's clear that field cannot contain characters that collation
> can't handle.
> Of course, when decoupling is only internal and still
> exists on logical level, there is no problem, but if new API would
> break
> that, wouldn't be possible to store for example Czech and Chinese
> characters in field and then ask for Czech collation which wouldn't
> handle Chinese with odd results? Well, Czech&Chinese example is a
> little
> bit stretched, but you see the point. Actually, it's another
> incarnation
> of problem with collations for UNICODE_FSS charset, so it might be
> already solved in the process to deliver them.

I see the point Pavel. We must clearly distinguish among
transliteration and collation.
We can convert any charset to its unicode representation and have this
unicode representation encoded in utf-8. The reverse is untrue of
course. Attempting to map an arbitrary utf-8 string to ISO8859_X or
GB18030 can fail. Unicode standard contains lots of recommendations and
recipes (good and bad) about how to act upon such cases. The code in FB
which would be responsible for conversion from utf-8 to anything else
would have to implement and document clearly what it does in such
cases. It often means to replace characters that can't be mapped by
some default character (different from one charset to another).

Ability to store a string containing both Czech and Chinese is a must
for global applications.
If a local application is only interested to store Czech and declare
all the database charset or columns charsets accordingly, then the
engine must only return Czech charset to the client appl. If this
application or some other process happen to store Chinese in a field
(by some other mean - but which one?), then the transliteration from
internal representation (which understand Chinese) to Czech must flag
those characters which could not be mapped. Such a situation must not
trigger an error at FB level. The application itself will handle it if
needed.

Regarding collation, I see no issue.
Without collation, all strings would collate according to their unicode
character value (easy: a sort of utf-8 buffer solely based on its byte
values sorts the same as the unicode values of each character of this
buffer).
With collation, let's say french or czech, some of the unicode values
will be altered so that they sort correctly for french.
If I'm interested in collating a column in french, it is because I
expect it to only contain french. If it happens to also contain
Chinese, then that data will be sorted alltogether based on their
unicode value.

So a simplified view of what a collation table would be is a simple map
of some unicode values to some other binary value. To sort according to
the collation, the sort of the characters is done based on their
unicode value passed through that map. Any character not considered by
the collation is not in the map and sorts according to its original
unicode value.

--
Olivier Mascia