Subject Re: [Firebird-Architect] UTF-8 Everywhere
Author Jim Starkey
It actually isn't more efficient after punctuation, digits, and white space, all single character codes are taken into consideration.  As for coding, since multi-byte characters have to be considered in both cases, the code is close to identical.  But even if there were an advantage, it would be so slim as to be pointless.

Sixteen bit Unicode representations are basic mistakes from when it was assumed that Unicode could be represented in 16 bits, it can't, and to salvage a bad decision, systems switched from "Unicode" to UTF-16.  This is exactly what happened in Java -- a "char" was defined to hold a single 16 bit Unicode character.  At the time Java was launched, much was made of making characters 16 bits to avoid hassle of having to code for multi-byte characters.  A noble aspiration, perhaps, but one dashed by reality.

And it isn't just a storage issue -- it's a network issue as well.

A single internal encoding has so many benefits that it is hard to list them all, but the complete elimination of the need to track character set along with a string is surely at the top of the list.


On 1/17/2014 10:18 PM, Paul Vinkenoog wrote:
 

Jim Starkey wrote:

> What is the case for UTF-16?

Much more efficient than UTF-8 for East-Asian languages.

> Or, more properly, what is the case for the two different UTF-16s?

There's only one UTF-16, but because UTF-16 is made up of 2-byte words,
endianness matters. The standard allows specifying the endianness like
this: UTF-16LE / UTF-16BE. But it's still one encoding, just like a
16-bit two's complement signed integer is one encoding. Endianness has
to do with storage, not encoding. (Still a pain in the ass though, but
one that we have learned to live with.)

Paul Vinkenoog