firebird-architect - Re: [Firebird-Architect] Re: [firebird-support] Writing UTF16 to the database

Subject	Re: [Firebird-Architect] Re: [firebird-support] Writing UTF16 to the database
Author	Olivier Mascia
Post date	2005-02-28T09:41:33Z

Le 24-févr.-05, à 07:14, Dimitry Sibiryakov a écrit :

> If I ment UTF-32, I'd say UTF-32. IMHO, the main problem with all
> versions of UTF encodings is their variable character length. No?
> If anyone can name a language that (s)he would like to store in FB
> but it is not presented in UCS-2, we can consider using UCS-4.

That would be a dumb choice (to opt for UCS-4). 32 bits per character
is too much than needed. Unicode needs more than 16 bits, but less than
24 bits.

16 bits per character is not enough unless you actually implement
UTF-16 correctly, which *is* a variable length encoding. UTF-16 can use
more than one 16 bit word to represent a single character. This offset
the advantage of using a pure 16 bits encoding. What's more, UTF-8 has
other big advantages over all other solutions.

1) There is NO zero-byte inside a UTF-8 string. Except if the character
you want to store is actually NUL, there is no zero byte as part of the
single, double, triple or quad bytes representation of any character.

2) Being a stream of bytes, there is NO endianness issues with UTF-8.

Despite the fact that some people national characters would require 2
to 3 bytes and some ideographic character would extend to 4 bytes,
UTF-8 is the single encoding, which is conceptually easy to handle,
easy to store (despite its variable length nature), which is complete
(covers the whole unicode standard), and is the closest to the concept
of a stream of bytes terminated by a zero (C-string). The fact that a
pure ASCII (7 bits) string in UTF-8 is exactly the same binary as those
same bytes in ASCII is also a nice facility.

--
Olivier Mascia