Subject Re: UTF-8 (various)
Author johnson_dave2003
Since I opened the can of worms, I guess I have to eat my share. :o)



--- In Firebird-Architect@yahoogroups.com, "Ann W. Harrison"
<aharrison@i...> wrote:
> johnson_dave2003 wrote:
> >
> >
> > In german, the character 'ß' (esstet) is absolutely equivalent
> > to 'ss'. So, 'Fußball' and 'Fussball' should be seen as equal. The
> > uppercase rule for 'ß' is that it becomes 'SS', so uppercase
> > ('fußball') = 'FUSSBALL'
> >
>
> Fascinating. So, this query would return both these results:
>
> select name from sports where name = 'Fußball'
>
> Fußball
> Fussball

Unless sports was the primary key, and sports collation was DE_DE, in
which case only one of Fußball or Fussball could exist at a time.


>
> What is the correct handling strlen and the character 'ß' - is it
one or
> two?

It is one character, meaning that steing length is not an indicator of
inequality

> And the result of this query?
>
>
> select name from sports
> where name = 'Fußball'
> and strlen (name) = strlen ('Fußball')
> and substring (name from 3 for 1) =
> substring ('Fußball' from 3 for 1)
>
>

Since ß is a single character, this query would return only Fußball.
However,

select name from sports
where name like 'Fuß%'

would return anything starting with Fuß or Fuss

> And then, how are double letters that collate as a single letter ('ll'
> in Spanish) handled in strlen and substring?

They are two letters to strlen, and each letter is individual to
substring.


> And, in collations that
> don't accent upper case letters is is OK that lower(upper('à') <> 'à'?

It has to be okay.

Don't forget that what is an accented character in one language may be
its own character in another. The Å character, for example, is an
accented character in many languages, but is a character in its own
right in Norwegian and Swedish.

And let's not leave out flavors of Arabic, where many single letters
have differing forms depending on whether they are alone, or when they
are not alone what letters precede or follow them.

This is why I feel that following the Java model for isolating
Collations in classes descending from a single abstract root is
probably the best way to implement collations.

Easy collations (English and those based on ISO-8859 character sets)
are easy to encode in UTF-8 because the first 255 code points
correspond to the ISO-8859-1 alphabet. Existing code can be largely
lifted and re-encapsulated. More complex collations can be added as
needed and as time allows without touching the core code.