Subject Re: [Firebird-Architect] UTF-8 Everywhere
Author Ann Harrison
Jason, 
 
Having a strictly UTF8 core could radically simplify many things. I see it as simply a matter of where you want to push all of the complexity. I'd like to hear more about what all the trade-offs would be.

Most of the complexity could go into the client interface - pushing conversions between character sets entirely to the client
side.  Everything on the database side could be just UTF8.

The major tradeoff is that the single byte part of UTF8 is approximately unaccented Latin characters, plus common punctuation, and digits. 
So, the cost for English strings is low.  Many phonetic alphabets are represented in a single byte in national character sets, so UTF8 increases the storage size for many western European languages somewhat because "decorated" Latin characters require two bytes,  For non-Latin alphabets it nearly doubles the storage size. The first bit(s) of each byte in UTF8 are length indicators - so a single byte has only room for 7 bits of "data".  My recollection is that the two byte section of UTF8 includes most live phonetic alphabets and a subset of major ideographic scripts.  Generally, text in UTF8 requires less storage than UTF16 because three byte characters are rare and one byte characters (digits, punctuation) are common.  Still, for languages that use the Greek or Cyrillic alphabet, there's a significant increase in storage required when going from a national character set to UTF8.

As for this change going into Firebird 3, not if you expect to see Firebird 3 any time soon.

Best regards,

Ann