firebird-architect - Re: UTF-8 over UTF-16

Subject	Re: UTF-8 over UTF-16
Author	David Johnson
Post date	2005-05-05T22:44:10Z

In the favorite Vogon-UTF encoding question ...

Neither the client nor the DBMS needs to be aware that they are dealing
with Vogon characters. They are simply characters so long as they
remain in the internal UTF-8 representation. Screen presentation is a
function of the operating environment. Mapping of UTF-8 to screen
presentation font is an issue for the GUI/termcaps tools.

>From the DBMS perspective, the only issue is do we have a Vogon

collation for the Vogon characters. If not, we will use binary
collation until someone writes a vogon localization dll/so. The UTF-8
to display font is a mapping function of the environment (Operating
System or JVM).

It is permissible and expected to have multiple collations concurrently
under any unicode specification, provided none of the recognized code
points in the concurrent collations overlap. For example, it is
permissible to require all of EN_US, AR_IR, and VG_VG collations at the
same time because none have overlapping UTF-8 code points. Anything
that is recognized by the EN_US collation (latin and some accented latin
characters) would be handled in accordance with English (US) rules, any
arabic data would be handled in accordance with the arabic (Iran) rules,
and any Vogon data would be handled in accordance with Vogon (Vogon)
rules. All others would follow binary sort order rules.

Collations only need to take into account language and locale, not the
display character font. Once again, the display font is a function of
the operating environment's mapping to UTF-8. Under linux for example,
arabic, hindi, and cyrillic text stored in current versions of firebird
are correctly inserted, retrieved, and displayed by the ISQL tool in
gnome-terminal. I was impressed to note that the display order of the
characters in the arabic string was actually corrected to be right-to-
left on output from ISQL under gnome-terminal.

If Vogon support was correctly installed on my machine (unlikely), it
would be supported already to the same extent that unicode is.

For reference, defined the columns in my test database as having no
character set and no collation. I mostly use Java now, so DBMS
collations are a convenience that I can survive without pending true
unicode support, provided everything else works correctly.

Only the server "needs" to be aware of collations. So long as the
client application uses the environment's UTF-8 to native font mapping
capability, which in my experience usually means does nothing, most of
these issues simply go away from the perspective of the DBMS and its
client stub.