Subject Re: [firebird-support] Writing UTF16 to the database
Author Brad Pepers
Lester Caine wrote:
> OK from discussions elsewhere ...
> UTF-8 can take upto 6 bytes to store data, UTF-16 needs 2 or 4 bytes to
> store the SAME data. The difference being that the bytes used to flag
> the need for an EXTRA byte are rather wasteful in UTF-8.
> The RAW data is only 21/22 bits so CAN be stored in three bytes. UTF-32
> uses four bytes, but the fourth byte is always empty, so internal
> storage as three bytes does not cause any problems.
> The FSS comes from File System Safe and just means that '00' bytes that
> would form part of a UTF-8 sequence are removed so that '00' can be used
> as the final byte of the string.

UTF-8 should never be over 4 bytes for data since 4 bytes of UTF-8 gives
you 21 bits of unicode which includes all the codes that are ever
supposed to be defined. Also it doesn't have '00' bytes that would form
part of a UTF-8 sequence as mentioned before though UTF-16 and UTF-32
will have '00' bytes appearing in the sequence. Also UTF-8 is all byte
oriented and doesn't suffer concerns over endian-ness like UTF-16 and
UTF-32 do.

> So what SHOULD happen with UNICODE_FSS is that UTF-8, UTF-16 and UTF-32
> strings can be written into the internal buffer with out any loss of
> content, and UNICODE_FSS characters can then be converted back to UTF-8,
> UTF-16 and UTF-32 as required. SO what is perhaps missing is the client
> interfaces handling the string conversions.

There *is* no definition of UNICODE_FSS is there? From what I
understand its the old name for what is now called UTF-8 and it would
likely confuse people less to use the current name rather than the old
one that you will rarely find mentioned anymore.

> SINCE all UNICODE has therefore been reduced to it's simplest format,
> sorting and collations should be easy, but as yet only the simplest
> sorting is available. THAT is the area that the INTL developers are
> working on.

I'm not sure about simplest format and there is nothing simple about
sorting and collation with Unicode. For example there are two
alternative forms for accented characters. With some there is a
specific Unicode character that has the accent on it but there is also a
more general way which are some mark-up codes that can appear *after* a
character to say the previous unicode character should be accented. So
before you do any collation and comparison you need to convert the
unicode string (regardless of UTF-8, UTF-16, or UTF-32 formatting) into
a canonical format. And there are also special joined characters that
have to be taken into consideration like "ch" in Slovak. Also unicode is
*not* in any usable sequence so you have to decide the order of the
characters and that depends on the locale you are working in for some
things but for others there is not clear order I know of (like whats the
ordering of all the little symbols in unicode like the little dagger?).
And finally in some countries certain characters are considered equal
while in others they are not which confounds a comparison operation.

For a technical discussion of collation and comparison with unicode,
please see:

http://www.unicode.org/reports/tr10/tr10-12.html

--
Brad Pepers
brad@...