firebird-support - Re: [firebird-support] UTF8 in firebird ?

Subject	Re: [firebird-support] UTF8 in firebird ?
Author	Mark Rotteveel
Post date	2012-01-06T09:47:15Z

On Fri, 06 Jan 2012 01:07:04 +0400, Vander Clock Stephane
<svanderclock@...> wrote:

>> No, you cannot use a column defined as ISO-8859-1 to store UTF8,

because

>> 1) ISO-8859-1 does not contain all UTF characters, and 2) some bytes in
>> ISO-8859-1 do not represent any characters. You could however use a
>> column
>> defined as CHARACTER SET OCTETS to store the byte representation of a

UTF

>> string, but then you would need to take care of decoding yourself.
>>
>
> no, you can store in iso-8859-1 ALL the UTF8 char :)
> this is the purpose of utf8, to stay compatible with all the previous
> system.

No it isn't possible. You could attempt to store unicode codepoints in
ISO-8859-1 by inventing your own encoding, but you cannot store UTF-8
encoded characters in ISO-8859-1 because the multi-byte encodings do not
fit in a single byte ISO-8859-1. If you would take multiple ISO-8859-1
characters to store the encoding, you cannot do that because some bytes (7F
- 9F) are not allowed in ISO-8859-1 (they are used in Windows-1252 which is
based on ISO-8859-1, but also uses 7F-9F).

> UTF8 use only ascii > 127 to encode special char. but as i know
> you i m sure you already know it before ... i just speak here about
> storage, not decoding ....

If you talk about storage of UTF8 without using actual UTF8, you need to
use CHARACTER SET OCTETS.

>> UTF-8 is a variable encoding that requires 1 to 4 bytes to encode
>> characters (theoretically 5 and 6 bytes is possible as well, but this

>> unused for compatibility with the number characters that can be encoded
>> in
>> UTF-16). This means that Firebird will use upto 4 bytes per character

>> the DB and - afaik - 4 bytes per character in memory because of the
>> way the
>> memory buffer is allocated.
>>
>
> take this exemple: in html all special char are handle like &ecute; <
> etc... did that
> mean that i will need to x 5 the size of my varchar field that i use to
> store
> html encoded text ?? of course not except if i store cyrrilic or
> chinesse char ...

That is not comparable at all as they are escape sequences not character
encodings (and if you use UTF8 as your page encoding for HTML, you don't
need to use most escape sequences).

>>
>> If you target Portugal, Spain, France and Italy, then ISO-8859-1 should
>> be
>> enough for your needs.
>
> but this is exactly the pupose of UTF8 !
> it's why they use only char > 127 to encode the special char
> and let untouched all the ascii < 127 ! so why i will need to make some
> nightmare
> conversion between polish char, spanish char, french char etc.. in
> ISO8859-1 when UTF8 is defined exactly for this !

Spanish and French work just fine with ISO-8859-1, if you also need
Polish, then yes you will definitely need UTF8.

>> That is not how it is supposed to work. You define the size in

characters

>> as specified in the standards, not in byte. If that is a problem for

you,

>> you should look at using CHARACTER SET OCTETS and handle your own
>> decoding
>> and encoding from and to UTF8.
>>
>> I think things should (and probably can) be improved, but not by

breaking

>> the SQL standards.
>>
>
> i will simply say :
>
> "standard is just a word ! a system that is 2 times more slower is a

fact

> !"

Do not confuse efficiency of implementation with breaking standards to
achieve that efficiency.

> with you prefer ? me i know what i will not prefer :)

I prefer standards compliancy first, performance second.

> and even you say yourself, in the true of the true standard, utf8 must
> be encoded
> in up to 6 char even ! :)

That is not what I said. UTF-8 encoding was originally devised to allow
for encoding 2^31 - 1 characters using variable length encoding of 1 to 6
bytes (which is afaik the entire range of unicode codepoints), but because
UTF16 only encodes 2^16-1 characters and uses surrogate pairs for higher
order codepoints, the decision was made by the standards committee to only
use UTF-8 encoding upto 4 bytes, so the same range of characters as UTF16
could be encoded to make coding between UTF16 and UTF8 easier.

Mark