firebird-architect - Re: [Firebird-Architect] Re: UTF-8 (various)

Subject	Re: [Firebird-Architect] Re: UTF-8 (various)
Author	Brad Pepers
Post date	2005-03-02T21:56:02Z

Jim Starkey wrote:

> David Johnson wrote:
>
>
>>In engine implementation, a Locale object will exist for every collation
>>that is supported. Subclasses of Locale will implement localized rules
>>for case conversion and string comparison.
>
> The engine doesn't really have a locale. The server, which contains an
> engine, certainly doesn't have a locale. A locale, at best, is a
> property of a client or, more properly, a connection. The question is
> whether the simplicity and, perhaps, performance in the engine, server,
> and remote interface balances the increase in bytes transmitted over the
> wire.

But Locale ~= Collation. Locale is a country and language and for the
most part that defines the collation sequence users want to see. Many
of the collations in Firebird right now have names like EN_US which is
just a locale. Perhaps there are some legacy collation sequences that
are not based on the country/language though but I don't know. Here is
a summary layout of things to try and make sense of it all:

Character Sets:
---------------

This is a mapping of characters in the real world to binary data (1 or
more bytes). Right now you can specify a default character set when you
create a database, a character set the client is using for data it
sends, and you can over-ride the character set for a specific column in
a table. What I think it should be in future is:

1. Use UTF-8 internally in the engine for all data.

2. For all interactions with the outside world from the engine, either
enforce that the data to/from the engine is in UTF-8 (thus forcing
the clients to manage character set translations) or else allow for
setting the character set to use and make the engine convert the
data to UTF-8. Interactions with the outside world come from the
client API's, external tables, UDF's, and likely more I don't know
about.

3. The default character set for the database and the character set on
a column act as a kind of domain declaration either causing an error
or just a warning when data it stored that is outside the character
set specified.

Collations:
-----------

This is a way to sort strings. It is done internally so it only has to
deal with the UTF-8 character set but it needs to know the locale in
order to sort properly. The locale is the country and language as per
the various ISO standards. As far as I know this is enough to get the
collation sequence right (ie: no case exists where one country/language
has more than one collation sequence).

Right now the collation sequence can be specified on a column or as part
of the order by in a select statement. I'm not sure this maps well to
the real world. I'm not sure why there isn't a higher level setting
than these ones. What collation sequence is used if I don't specify one
on the table columns or in the order by? If the internal data is all
UTF-8 then there doesn't seem to be a good default. You could try to
choose a default based on the character set assigned to the database but
thats not a very good match (whats the right collation sequence when
using ISO-8859-1 as a character set???). So I think this area needs to
be re-thought. I can see either having a default collation sequence at
the database level to go along with the character set or else have the
client pick a collation sequence when it connects (or both?). As a
matter of fact there is no really sane collation sequence if one isn't
specified at the database, client, or column level so one of these
should be manditory or else the engine has to perhaps pick to use its
own system locale to decide?

So you could have a collation at the database level or from the client
as well as on a table column or in the order by of a select statement
and the engine would have to pick one of these to use (perhaps falling
back on using its own system locale if none of the above are set). The
collation order would be specified by a country and language so it
would be a one-to-one mapping to a locale.

Ideally it might be nice to have the strings in the database stored in
UTF-8 and then allow the client to specify their locale and have things
sort properly for them. This way you could have an application running
in Switzerland and in Germany using the same data (names for instance)
and each seeing the data sorting in the right order for their locale.
This would play havoc with indexes though but it would be nice!

Comparison:
-----------

Comparison is how to decide if two strings are equal. Perhaps its part
of the collation above but its a more specific thing and its effected by
the change to UTF-8. Basically, like in the above, you need to know the
locale in order to compare two strings to see if they are equal. So
this also depends on the locale and would be effected as above where you
can set the locale at the database, client, column, or order by level.
I'm calling it locale here instead of collation since I think this
would be a better term to use since its the more general thing that is
being dealt with. Your locale determins your collation sequence and how
comparisons are done and at the client level should also determin things
like how dates and times are formatted. It could also effect the engine
side of things down the road. For instance if an insert statement is
setting a date column and is using a string then how is that string to
be interpreted as a valid date? You could just say it has to be in the
ISO standard of YYYY-MM-DD but you could also decide to allow the
engine to interpret it based on the locale.

So the changes would be:

1. Use UTF-8 internally for everything

2. Classes to convert from a character to/from UTF-8 (ideally run-time
loaded modules so this could be extended with new character sets in the
future).

3. Specify the character set for all engine to external world API's

4. Specify the character set on the database or column level which acts
as a domain to verify strings stored in the database

5. Add ability to set locale when creating a database and perhaps having
the engine default to using its own locale if one isn't specified

6. Possible ability for client to also specify its own locale

7. Specify a different locale on a table column to enforce how
collations and comparisons are done on that column

Notes:

1. If you have the classes to convert between character sets, should
this be done at the client end? So should all data flowing from/to the
engine be in UTF-8 and its up to the client to convert it? It could be
done with the API so that the data coming from the engine over the
network would be UTF-8 but it would be converted before getting to the
application but this means each client computer would have to have the
libraries for character set conversions.

2. I have no idea the effect all of this would have on indexes. From
what I understand now the index is binary data created by converting the
string into weighted values based on the collation sequence so that the
binary representation sorts properly. So what happens if there is a
mis-match between the collation sequences (locales)? If the client can
specify their own collation sequence, then the ones stored in the index
may not be right. Are indexes also used for equality comparisons and if
so this too could be a problem.

--
Brad Pepers
brad@...