firebird-support - Re: [firebird-support] UTF8 in firebird ?

Subject	Re: [firebird-support] UTF8 in firebird ?
Author	Vander Clock Stephane
Post date	2012-01-05T21:07:04Z

> No, you cannot use a column defined as ISO-8859-1 to store UTF8, because
> 1) ISO-8859-1 does not contain all UTF characters, and 2) some bytes in
> ISO-8859-1 do not represent any characters. You could however use a column
> defined as CHARACTER SET OCTETS to store the byte representation of a UTF
> string, but then you would need to take care of decoding yourself.
>

no, you can store in iso-8859-1 ALL the UTF8 char :)
this is the purpose of utf8, to stay compatible with all the previous
system.
UTF8 use only ascii > 127 to encode special char. but as i know
you i m sure you already know it before ... i just speak here about
storage, not decoding ....

>
> UTF-8 is a variable encoding that requires 1 to 4 bytes to encode
> characters (theoretically 5 and 6 bytes is possible as well, but this is
> unused for compatibility with the number characters that can be encoded in
> UTF-16). This means that Firebird will use upto 4 bytes per character in
> the DB and - afaik - 4 bytes per character in memory because of the
> way the
> memory buffer is allocated.
>

take this exemple: in html all special char are handle like &ecute; <
etc... did that
mean that i will need to x 5 the size of my varchar field that i use to
store
html encoded text ?? of course not except if i store cyrrilic or
chinesse char ...

>
> If you target Portugal, Spain, France and Italy, then ISO-8859-1 should be
> enough for your needs.
>

but this is exactly the pupose of UTF8 !
it's why they use only char > 127 to encode the special char
and let untouched all the ascii < 127 ! so why i will need to make some
nightmare
conversion between polish char, spanish char, french char etc.. in
ISO8859-1 when UTF8 is defined exactly for this !

UTF8 is not perfect for russian or chiness char where 100 % of the char
need to be encoded
(for this it's mostly UTF16), UTF8 s perfect just for latin char where
only 20% of
the char need to be encoded

>
> Unfortunately, as Ann indicates, the RLE used by Firebird is per byte, and
> not per character. This means that the compression is less efficient then
> with single byte charactersets because of the way codepoints above 127 are
> encoded, and I believe that there remaining 0x00 bytes at the end of the
> string are also RLE encoded and stored (which I think is something which
> could and should be changed).
>

yes, but in the actual state it's 2x better to use single byte (octet or
ISO8859-1 fr
exemple) to store UTF8 char ... SAD :(

> > and to finish i add (only) 64000 reccords in both table (only with
> > varchar containing ascii between a..z)
>
> What is the exact content and length and are their repeating characters in
> it?
>

random length and random ASCII CHAR (from a to z), but exact same length
in both
database

> > so the utf8 database is around 35% more bigger than the ISO8859_1
> database!
>
> What happens when you backup and restore the databases?
>

stay the same

>
> > select count(*) from TEST_A
> > in iso8859_1: 212 ms
> > in utf8: 382 ms !!! UP to 80% more slower !!!!
>
> I am not 100%, but this probably has to do with the fact that Firebird
> will need to allocate larger buffers in memory for UTF-8 characters.
>

yes probably but it's far away to be a good UTF8 implementation :(

> > when i declare in utf8 varchar(250) i want to reserve 250 bytes not 1000
>
> > bytes and i know (like for exemple in html)
> > that some char can be encoded in more than one byte! if i know that i
> > will handle russian char, i will set UP
> > as varchar(750) and if i know that i will handle only latin language i
> > will set up to varchar(300) ...
> > this setup must be done only by the database administrator ...
>
> That is not how it is supposed to work. You define the size in characters
> as specified in the standards, not in byte. If that is a problem for you,
> you should look at using CHARACTER SET OCTETS and handle your own decoding
> and encoding from and to UTF8.
>
> I think things should (and probably can) be improved, but not by breaking
> the SQL standards.
>

i will simply say :

"standard is just a word ! a system that is 2 times more slower is a fact !"

with you prefer ? me i know what i will not prefer :)

and even you say yourself, in the true of the true standard, utf8 must
be encoded
in up to 6 char even ! :)

[Non-text portions of this message have been removed]