firebird-architect - Re: [Firebird-Architect] UTF-8 and Compression

Subject	Re: [Firebird-Architect] UTF-8 and Compression
Author	Olivier Mascia
Post date	2005-03-02T08:46:14Z

Le 02-mars-05, à 08:07, Claudio Valderrama C. a écrit :

> Olivier Mascia wrote:
>
>> So a simplified view of what a collation table would be is a simple
>> map of some unicode values to some other binary value. To sort
>> according to the collation, the sort of the characters is done based
>> on their unicode value passed through that map. Any character not
>> considered by the collation is not in the map and sorts according to
>> its original unicode value.
>
> Nice simplified view for a discussion..

Yes Claudio and that post by me was nothing more than a simplified
view. I don't want to assign a single weight to a character as you then
develop. I'm not that extreme kind of people ! ;-) See my other posts.

> I think that English's a+e (like Encyclopædia and Hændel) is
> deprecrated, so
> people in the US and UK are happy with the first 127 positions in the
> ASCII
> table. But I find surprising that a French speaker wants to assign a
> single
> weight to a character. Does your weight consider acute, grave and
> circumflex
> accent?

Unfortunately you develop a complete argumentation and discussion (very
interesting btw), solely based on an over-simplified previous statement
by me. To shorten our discussions and also because I'll fly away for
some days in 12 hours from now without guarantee to be able to continue
the discussion over the next 3 to 5 days, be sure Claudio that I
recognize the number of specifics of languages, including those of
French. The one important thing is the initial unformal proposal to
consider using and storing a single character set in the engine, be it
Unicode and utf-8 its encoding. I just wanted to insist that from then
on, conceptually, a collation is simpler because we have a common
character set representation as basis.

> Are you saying that a single comparison satisfies ordering at all?

Absolutely not. But I think my point will have been clarified by now.

> What will happen to an indexed string field that's stored always in
> unicode
> (as proposed) but whose charset is declared iso8859_1 and the
> collation is
> Spanish? Will the engine create two indices?

Why would you need two indices ?
You declared your column to store iso8859_1 using spanish collation.
The database system will expect you to enter only iso8859_1 valid
strings, will store them as utf-8, and will never give you anything
else than iso8859_1 on output for that column as this is what was
requested in the DDL.
Regarding sorting of that column or indexing of that column, the
database system will index and sort according to Spanish collation as
this is what you asked for. The only difference is that the string will
not be stored and moved around inside the database system as is in
iso8859_1 but re-encoded in utf-8. Inside the engine you only have
_one_ charset to deal with, and many collations. Instead of _many_
charsets and _many_ collations. That's a simplification which can allow
to better concentrate on more perfect collations.

--
Olivier Mascia