firebird-architect - RE: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams

Subject	RE: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams
Author	Svend Meyland Nicolaisen
Post date	2005-05-03T20:23:02Z

> -----Original Message-----
> From: Firebird-Architect@yahoogroups.com
> [mailto:Firebird-Architect@yahoogroups.com] On Behalf Of Jim Starkey
> Sent: 3. maj 2005 14:01
> To: Firebird-Architect@yahoogroups.com
> Subject: Re: [Firebird-Architect] UTF-8 over UTF-16 WAS:
> Applications of Encoded Data Streams
>
>
>
> Svend Meyland Nicolaisen wrote:
>
> >
> >
> >>Do you have any statistical data to show that UTF-16 consumes few
> >>bytes that UTF-8?
> >>
> >>
> >>
> >
> >No, I have no statistics ready. I suppose that if you mainly uses
> >characters from the Latin 1 character set then UTF-8 will be better
> >compressed than UTF-16. But texts containing Japanese or Thai
> >characters seems to be better compressed with UTF-16.
> >
> >
> >
> You may have a point; there are many more Chinese than
> Europeans, so a global optimization suggests that a bias
> towards larger character sets is warranted. But the
> distribution of current users heavily favor the latin
> character distribution, and single bytes are a lot easier to
> handle than shorts. If the favor of hard questions, go for
> simplicity.

I would think that even Latin 2 or Cyrillic character sets would make UTF-16
a more compressed format than UTF-8. So it's not only East Asian character
sets that would favour UTF-16. But I don't have any exact scientific
statistics to prove it. But if somebody could send me some test documents in
various languages/locales I would be happy to make those statistics. These
statistics would need to be made based on representative texts in character
sets like Cyrillic, Latin 2, Latin 1 and Thai. They need to be encoded in
UTF-8 or 16 however. :-)

/Svend