Subject | Re: [Firebird-Architect] UTF-8 Everywhere |
---|---|
Author | Jim Starkey |
Post date | 2014-01-18T16:50:14Z |
Sixteen bit Unicode representations are basic mistakes from when it was assumed that Unicode could be represented in 16 bits, it can't, and to salvage a bad decision, systems switched from "Unicode" to UTF-16. This is exactly what happened in Java -- a "char" was defined to hold a single 16 bit Unicode character. At the time Java was launched, much was made of making characters 16 bits to avoid hassle of having to code for multi-byte characters. A noble aspiration, perhaps, but one dashed by reality.
And it isn't just a storage issue -- it's a network issue as well.
A single internal encoding has so many benefits that it is hard to list them all, but the complete elimination of the need to track character set along with a string is surely at the top of the list.
On 1/17/2014 10:18 PM, Paul Vinkenoog wrote:
Jim Starkey wrote:
> What is the case for UTF-16?
Much more efficient than UTF-8 for East-Asian languages.
> Or, more properly, what is the case for the two different UTF-16s?
There's only one UTF-16, but because UTF-16 is made up of 2-byte words,
endianness matters. The standard allows specifying the endianness like
this: UTF-16LE / UTF-16BE. But it's still one encoding, just like a
16-bit two's complement signed integer is one encoding. Endianness has
to do with storage, not encoding. (Still a pain in the ass though, but
one that we have learned to live with.)
Paul Vinkenoog