Subject Re: UTF-8 (various)
Author johnson_dave2003
--- In Firebird-Architect@yahoogroups.com, Jim Starkey <jas@n...> wrote:
> The engine doesn't really have a locale. The server, which contains an
> engine, certainly doesn't have a locale. A locale, at best, is a
> property of a client or, more properly, a connection. The question is
> whether the simplicity and, perhaps, performance in the engine, server,
> and remote interface balances the increase in bytes transmitted over
the
> wire.
>

I think Brad Pepers expressed my thoughts more clearly than I have.

Brad was correct in suggesting that the term Locale is not the right
one in this case. As he suggests, "Collation" is a much better
choice. My Java bias is showing - the class in Java that handles
these questions is called "Locale", and it handles a number of other
localization issues as well (currency, preferred time format, etc).
In the database engine we are only concerned with collation and
comparison issues. In the client we are concerned with localization.

I believe that character sets in the plural belong to the client side.
The engine should only have one character set (UTF-8), and the client
is responsible for all character set conversions outside of the UTF-8
paradigm.

If you specify a column to be ISO-8859-1 (for example), then the
metadata returned to the client should indicate that a conversion to
8859-1 is desired in the absence of other instructions. The client,
unless overridden by conversion explicitly defined by the application
or environment, should honor the request when the column data is
retrieved.

A real life example from my workplace - I have data that I believe is
most likely to be used in ISO-8859-1 so I declare it as such in the
table. Then someone whose system is operating with EBCDIC CP500 views
my data. Since both character sets can display the same characters,
the display system should be transparent. Users should see the
correct characters displayed even though there is almost no
intersection between the two character sets in terms of binary values.

If the application needs a character set other than the one specified
in the DDL, then the client should be able override the ISO-8859-1
conversion and use the appropriate conversion instead. In my
position, I often need to post data to or from an IBM mainframe in
CP500 EBCDIC character set with EN_US collation. Having the character
set interpreted as a suggestion for the client to handle the string
data rather than as a mandatory storage format eliminates a host of
character conversions and effectively isolates the character set from
the collation.

Handled this way there is only ever one character set conversion - at
the client, in accordance with the client application's actual need of
the moment.

UTF-8 only uses exactly as many bytes as necessary to represent any
character. Under UTF-8, the bulk of european lanuages can be
represented on the wire with 90% or more of the transmitted data in
the form of single byte characters.

But, you can represent chinese, arabic, and mathematical symbols in
the same data stream with no special coding, or even awareness of what
they are. They are simply a byte stream. This means that
communication code is light weight, simple, efficient, and isolated
from either engine or business logic.

I believe that the simplification in code that this model implies
would allow better tuning and maintenance, and will more than make up
for the little bit of overhead associated with carrying multi-byte
characters across the wire. If nothing else, it will allow ample time
for incorporation of gzip or lzh as an option into the communications
channel.

Since the engine needs to be able to respond to requests of the form
CREATE INDEX ... COLLATION XX_XX, and SELECT ... ORDER BY ...
COLLATION XX_XX, the engine needs to have the mechanism for
understanding collations, and collations are generally named in terms
of the locale that they apply to.

I see collation rules being handled as follows, with the caveat that
this is very abstract, not all levels may apply to Firebird, and not
all levels of Firebird may be properly represented:

1. Collation specified in query DML

2. Collation specified on index DDL

3. Collation specified on table DDL

4. Collation specified on schema

5. Collation specified on database

6. Collation specified in engine configuration

7. If no other collation is specified, default to binary