Subject | Re: UTF-8 (various) |
---|---|
Author | johnson_dave2003 |
Post date | 2005-03-05T03:11:51Z |
Since I opened the can of worms, I guess I have to eat my share. :o)
--- In Firebird-Architect@yahoogroups.com, "Ann W. Harrison"
<aharrison@i...> wrote:
which case only one of Fußball or Fussball could exist at a time.
inequality
However,
select name from sports
where name like 'Fuß%'
would return anything starting with Fuß or Fuss
substring.
Don't forget that what is an accented character in one language may be
its own character in another. The Å character, for example, is an
accented character in many languages, but is a character in its own
right in Norwegian and Swedish.
And let's not leave out flavors of Arabic, where many single letters
have differing forms depending on whether they are alone, or when they
are not alone what letters precede or follow them.
This is why I feel that following the Java model for isolating
Collations in classes descending from a single abstract root is
probably the best way to implement collations.
Easy collations (English and those based on ISO-8859 character sets)
are easy to encode in UTF-8 because the first 255 code points
correspond to the ISO-8859-1 alphabet. Existing code can be largely
lifted and re-encapsulated. More complex collations can be added as
needed and as time allows without touching the core code.
--- In Firebird-Architect@yahoogroups.com, "Ann W. Harrison"
<aharrison@i...> wrote:
> johnson_dave2003 wrote:Unless sports was the primary key, and sports collation was DE_DE, in
> >
> >
> > In german, the character 'ß' (esstet) is absolutely equivalent
> > to 'ss'. So, 'Fußball' and 'Fussball' should be seen as equal. The
> > uppercase rule for 'ß' is that it becomes 'SS', so uppercase
> > ('fußball') = 'FUSSBALL'
> >
>
> Fascinating. So, this query would return both these results:
>
> select name from sports where name = 'Fußball'
>
> Fußball
> Fussball
which case only one of Fußball or Fussball could exist at a time.
>one or
> What is the correct handling strlen and the character 'ß' - is it
> two?It is one character, meaning that steing length is not an indicator of
inequality
> And the result of this query?Since ß is a single character, this query would return only Fußball.
>
>
> select name from sports
> where name = 'Fußball'
> and strlen (name) = strlen ('Fußball')
> and substring (name from 3 for 1) =
> substring ('Fußball' from 3 for 1)
>
>
However,
select name from sports
where name like 'Fuß%'
would return anything starting with Fuß or Fuss
> And then, how are double letters that collate as a single letter ('ll'They are two letters to strlen, and each letter is individual to
> in Spanish) handled in strlen and substring?
substring.
> And, in collations thatIt has to be okay.
> don't accent upper case letters is is OK that lower(upper('à') <> 'à'?
Don't forget that what is an accented character in one language may be
its own character in another. The Å character, for example, is an
accented character in many languages, but is a character in its own
right in Norwegian and Swedish.
And let's not leave out flavors of Arabic, where many single letters
have differing forms depending on whether they are alone, or when they
are not alone what letters precede or follow them.
This is why I feel that following the Java model for isolating
Collations in classes descending from a single abstract root is
probably the best way to implement collations.
Easy collations (English and those based on ISO-8859 character sets)
are easy to encode in UTF-8 because the first 255 code points
correspond to the ISO-8859-1 alphabet. Existing code can be largely
lifted and re-encapsulated. More complex collations can be added as
needed and as time allows without touching the core code.