Subject Re: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams
Author Brad Pepers
Ann W. Harrison wrote:
> Adriano dos Santos Fernandes wrote:
>
>>Why choose one side if all can be satisfied?
>>Some people are inteligent enough to set the character set of your
>>fields. ;-)
>>
>
> One of the assumptions in this discussion is that a future ODS will
> store exactly one character set, probably ODS-8. Fields and connections
> can specify other character sets and data will be converted on input and
> output to the specified representation. At the moment, we're facing
> an m * n problem of character sets and collations, which shows every
> sign of getting worse. Choosing one character set for storage reduces
> that to an m + n problem - m = supported character sets, n = collations.

I'm not sure what you mean by one character set for ODS-8. If you use
UTF-8 internally its not a character set. Its an encoding of Unicode
which is all characters. You still need n collations since the sorting
and equality and such are by locale but there is only one type of character.

If you use UTF-8 for everything internally though but allow the client
using the API to specify a different charset, then you need to convert
any text you get into UTF-8 using the character set the client specified
and you need to convert any text you send to the client's specified
character set. In doing this you may have a problem with the output
conversion since the UTF-8 data stored in the database may not be
mappable to the client's specified character set in which case you need
to drop those Unicode characters or raise an error or something. You
could also allow specification of the character set the database should
be using and ignore any data received from the client that isn't in this
character set but I don't think this is strictly needed but it would be
good to do.

> Since the database on disk uses the operating system rules for data
> representation (including endian-ness) perhaps a compromise on the UTF-8
> UTF-16 issue would be to use UTF-16 for storage and UTF-8 for transport.

I think your life would be a lot easier to pick one internal encoding
for Unicode rather than two and I think that UTF-8 is the better choice
since it doesn't have endian issues or null bytes to worry about but one
way or another if you choose two different encodings internally you will
likely end up having to convert between them which adds an extra encode
or decode thats extra overhead.

My third-party opinion would be to use Unicode everywhere internally and
to pick one encoding to use for it (UTF-8 or UTF-16 being the most
obvious choices). Then you will also need the ability for the client to
specify the character set of the text its expecting to use (which could
include setting it to UTF-8 or UTF-16 which if it matched the encoding
used internally would mean no encoding or decoding would be required).
You should also allow specification of the character set for a database
which would only be used as a filter of text received from the client to
strip out characters it shouldn't be passing in. And finally you need
to be able to specify the locale in order to do sorting and upper/lower
case conversions properly.

From this I can see a CharacterSet base class that defines conversion
of a block of text from/to the internal Unicode encoding and returns the
name of the character set. Make subclasses of this be loadable modules
so you can add new character sets easily. The Locale portion is more
complex since the first design approximation is to have the same sort of
class heirarchy as just mentioned for CharacterSet with methods to
compare two Unicode strings and order them for a given locale and
methods to convert a Unicode string to uppercase or lowercase which I
think would all work but I think there are some good Unicode docs on
this whole issue since its more complicated than you might think. So a
review of what the Unicode Council has to say on collation of Unicode
strings would be a good starting point and there may even be some code
around to do a lot of this. I'm using the ICU code from IBM interally
in my product to handle a lot of I18N and L10N issues.

Just some thoughts...

--
Brad Pepers
brad@...