Subject Re: [Firebird-Architect] UTF-8 and Compression
Author Pavel Cisar
Hi,

Jim Starkey wrote:
>
> Stunningly brilliant (i.e. better than good). We currently have an
> n-squared problem of character sets confounded by collating sequences.
> Switching to a universal internal representation reduces it to a problem
> linear with the number of collating sequences. It breaks the binding
> between character sets and collations. A character set becomes a
> bidirectional mapping between the character set and UTF-8. A collation
> becomes a simple object that compares two UTF-8 strings, generates a
> key, upcases, and downcases (what have I neglected here?). In the new
> API we can probably isolate character set conversions to the client,
> leaving the engine with a single internal representation and collation
> sequences. The legacy API will need to support per-SQLVAR character
> sets, but the "new API" can probably get away with pure UTF-8. New
> layered APIs, the formalization of IscDbc, can be defined with a single
> per-session locale, which fits the Java model and simple sanity nicely.

Pardon my ignorance, but doesn't the decoupling of the character set
from collation cause problems? I mean that collation would handle only a
subset of UTF-8 for particular charset. As long as collation is bound to
charset, it's clear that field cannot contain characters that collation
can't handle. Of course, when decoupling is only internal and still
exists on logical level, there is no problem, but if new API would break
that, wouldn't be possible to store for example Czech and Chinese
characters in field and then ask for Czech collation which wouldn't
handle Chinese with odd results? Well, Czech&Chinese example is a little
bit stretched, but you see the point. Actually, it's another incarnation
of problem with collations for UNICODE_FSS charset, so it might be
already solved in the process to deliver them.

Just my 0.02c

Best regards
--Pavel