firebird-architect - Re: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams

Subject	Re: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams
Author	Jim Starkey
Post date	2005-05-04T03:35:15Z

Brad Pepers wrote:

>If you use UTF-8 for everything internally though but allow the client
>using the API to specify a different charset, then you need to convert
>any text you get into UTF-8 using the character set the client specified
>and you need to convert any text you send to the client's specified
>character set.
>

Yes, one the client side. In the scheme under discussion, all character
data is transfered as utf-8.

>In doing this you may have a problem with the output
>conversion since the UTF-8 data stored in the database may not be
>mappable to the client's specified character set in which case you need
>to drop those Unicode characters or raise an error or something.
>

OK, this seems like an interesting rat hole. Let's go down it. In
honor of the movie, let's call the character set "vogon".

For anything to work, the client side remote interface must be able to
load an international module that handles "vogon" when the client
declares its locale. The module, by definition, must contain a
bidirectional mapping between vogon and Unicode. So the question comes
down to what happens when vogon has a character without a Unicode code
point. The correct long term solution is that the vogons visit the
Unicode Association and read them poetry until the Unicode Association
blesses the required code points. If, for reasons of desparation and/or
expediency, this isn't feasible, the implements unofficially expropriate
some unrelated code point and just use it. The database doesn't give a
damn about what the character might mean. It just stores it and gives
it back. If somebody wants to implement a collation (server side) for
vogon, it uses the expropriated code points and vogon rules.

Utf-8 is just a mapping from Unicode code points into what boring people
call octets to justify their existences. Unicode is what we're really
talking about.

> You
>could also allow specification of the character set the database should
>be using and ignore any data received from the client that isn't in this
>character set but I don't think this is strictly needed but it would be
>good to do.
>

Nope, not good at all. But it isn't Firebird's problem. It's the
designer of the internationalization module's problem.

>
>
>
>>Since the database on disk uses the operating system rules for data
>>representation (including endian-ness) perhaps a compromise on the UTF-8
>>UTF-16 issue would be to use UTF-16 for storage and UTF-8 for transport.
>>
>>
>
>I think your life would be a lot easier to pick one internal encoding
>for Unicode rather than two and I think that UTF-8 is the better choice
>since it doesn't have endian issues or null bytes to worry about but one
>way or another if you choose two different encodings internally you will
>likely end up having to convert between them which adds an extra encode
>or decode thats extra overhead.
>

Yes, indeed. Depending on language, utf-8 may be twice as efficient of
utf-16 (ASCII, in specific) or slightly worse. Averaged across North
America, South American, Europe, Australia, Antarctica, and Greenland,
utf-8 is much better. In the final analysis, however, density isn't the
controlling issue, code simplicity is. Utf-8 wins hands down.

>My third-party opinion would be to use Unicode everywhere internally and
>to pick one encoding to use for it (UTF-8 or UTF-16 being the most
>obvious choices). Then you will also need the ability for the client to
>specify the character set of the text its expecting to use (which could
>include setting it to UTF-8 or UTF-16 which if it matched the encoding
>used internally would mean no encoding or decoding would be required).
>You should also allow specification of the character set for a database
>which would only be used as a filter of text received from the client to
>strip out characters it shouldn't be passing in. And finally you need
>to be able to specify the locale in order to do sorting and upper/lower
>case conversions properly.
>

Interesting question. Ignoring history, should the collating sequence
be controlled by client locale, data declaration, or explicit
declaration? And if all of the above, in what order? Secondary
question: how does locale indicate collating sequence? Implied by
character set? Made explicit?

I don't know about you, but after three parties, my opinions on
character sets and collations are worth very little.

>
> From this I can see a CharacterSet base class that defines conversion
>of a block of text from/to the internal Unicode encoding and returns the
>name of the character set. Make subclasses of this be loadable modules
>so you can add new character sets easily. The Locale portion is more
>complex since the first design approximation is to have the same sort of
>class heirarchy as just mentioned for CharacterSet with methods to
>compare two Unicode strings and order them for a given locale and
>methods to convert a Unicode string to uppercase or lowercase which I
>think would all work but I think there are some good Unicode docs on
>this whole issue since its more complicated than you might think. So a
>review of what the Unicode Council has to say on collation of Unicode
>strings would be a good starting point and there may even be some code
>around to do a lot of this. I'm using the ICU code from IBM interally
>in my product to handle a lot of I18N and L10N issues.
>
>
>

Thoughts are nice, but a definitive solution would be greatly appreciated...

>
>
>