Subject | Re: [firebird-support] UTF8 in firebird ? |
---|---|
Author | Mark Rotteveel |
Post date | 2012-01-06T09:47:15Z |
On Fri, 06 Jan 2012 01:07:04 +0400, Vander Clock Stephane
<svanderclock@...> wrote:
ISO-8859-1 by inventing your own encoding, but you cannot store UTF-8
encoded characters in ISO-8859-1 because the multi-byte encodings do not
fit in a single byte ISO-8859-1. If you would take multiple ISO-8859-1
characters to store the encoding, you cannot do that because some bytes (7F
- 9F) are not allowed in ISO-8859-1 (they are used in Windows-1252 which is
based on ISO-8859-1, but also uses 7F-9F).
use CHARACTER SET OCTETS.
encodings (and if you use UTF8 as your page encoding for HTML, you don't
need to use most escape sequences).
Polish, then yes you will definitely need UTF8.
achieve that efficiency.
for encoding 2^31 - 1 characters using variable length encoding of 1 to 6
bytes (which is afaik the entire range of unicode codepoints), but because
UTF16 only encodes 2^16-1 characters and uses surrogate pairs for higher
order codepoints, the decision was made by the standards committee to only
use UTF-8 encoding upto 4 bytes, so the same range of characters as UTF16
could be encoded to make coding between UTF16 and UTF8 easier.
Mark
<svanderclock@...> wrote:
>> No, you cannot use a column defined as ISO-8859-1 to store UTF8,because
>> 1) ISO-8859-1 does not contain all UTF characters, and 2) some bytes inUTF
>> ISO-8859-1 do not represent any characters. You could however use a
>> column
>> defined as CHARACTER SET OCTETS to store the byte representation of a
>> string, but then you would need to take care of decoding yourself.No it isn't possible. You could attempt to store unicode codepoints in
>>
>
> no, you can store in iso-8859-1 ALL the UTF8 char :)
> this is the purpose of utf8, to stay compatible with all the previous
> system.
ISO-8859-1 by inventing your own encoding, but you cannot store UTF-8
encoded characters in ISO-8859-1 because the multi-byte encodings do not
fit in a single byte ISO-8859-1. If you would take multiple ISO-8859-1
characters to store the encoding, you cannot do that because some bytes (7F
- 9F) are not allowed in ISO-8859-1 (they are used in Windows-1252 which is
based on ISO-8859-1, but also uses 7F-9F).
> UTF8 use only ascii > 127 to encode special char. but as i knowIf you talk about storage of UTF8 without using actual UTF8, you need to
> you i m sure you already know it before ... i just speak here about
> storage, not decoding ....
use CHARACTER SET OCTETS.
>> UTF-8 is a variable encoding that requires 1 to 4 bytes to encodeis
>> characters (theoretically 5 and 6 bytes is possible as well, but this
>> unused for compatibility with the number characters that can be encodedin
>> in
>> UTF-16). This means that Firebird will use upto 4 bytes per character
>> the DB and - afaik - 4 bytes per character in memory because of theThat is not comparable at all as they are escape sequences not character
>> way the
>> memory buffer is allocated.
>>
>
> take this exemple: in html all special char are handle like &ecute; <
> etc... did that
> mean that i will need to x 5 the size of my varchar field that i use to
> store
> html encoded text ?? of course not except if i store cyrrilic or
> chinesse char ...
encodings (and if you use UTF8 as your page encoding for HTML, you don't
need to use most escape sequences).
>>Spanish and French work just fine with ISO-8859-1, if you also need
>> If you target Portugal, Spain, France and Italy, then ISO-8859-1 should
>> be
>> enough for your needs.
>
> but this is exactly the pupose of UTF8 !
> it's why they use only char > 127 to encode the special char
> and let untouched all the ascii < 127 ! so why i will need to make some
> nightmare
> conversion between polish char, spanish char, french char etc.. in
> ISO8859-1 when UTF8 is defined exactly for this !
Polish, then yes you will definitely need UTF8.
>> That is not how it is supposed to work. You define the size incharacters
>> as specified in the standards, not in byte. If that is a problem foryou,
>> you should look at using CHARACTER SET OCTETS and handle your ownbreaking
>> decoding
>> and encoding from and to UTF8.
>>
>> I think things should (and probably can) be improved, but not by
>> the SQL standards.fact
>>
>
> i will simply say :
>
> "standard is just a word ! a system that is 2 times more slower is a
> !"Do not confuse efficiency of implementation with breaking standards to
achieve that efficiency.
> with you prefer ? me i know what i will not prefer :)I prefer standards compliancy first, performance second.
> and even you say yourself, in the true of the true standard, utf8 mustThat is not what I said. UTF-8 encoding was originally devised to allow
> be encoded
> in up to 6 char even ! :)
for encoding 2^31 - 1 characters using variable length encoding of 1 to 6
bytes (which is afaik the entire range of unicode codepoints), but because
UTF16 only encodes 2^16-1 characters and uses surrogate pairs for higher
order codepoints, the decision was made by the standards committee to only
use UTF-8 encoding upto 4 bytes, so the same range of characters as UTF16
could be encoded to make coding between UTF16 and UTF8 easier.
Mark