firebird-support - Re: [firebird-support] UTF8 in firebird ?

Subject	Re: [firebird-support] UTF8 in firebird ?
Author	Mark Rotteveel
Post date	2012-01-05T20:21:46Z

On Thu, 05 Jan 2012 21:10:15 +0400, Vander Clock Stephane
<svanderclock@...> wrote:

> I want to know if UTF8 is a good in Firebird so i do some
> tests. can you gave me your opinion ?
>
> I do some benchmark of UTF8 database vs ISO8859_1 database.
> I think the way that Firbird handle utf8 database is not optimal
>
> let speak first about the utf8
> UTF8 it's just a way to encode special character like è à etc ..
> for this utf8 will use combination of char upper than ascii #127
> In this way, and it's not the less, UTF8 stay compatible with all
> software that work with 8 bit string. thank to this we can use
> a firebird database defined as iso8859_1 to store UTF8

No, you cannot use a column defined as ISO-8859-1 to store UTF8, because
1) ISO-8859-1 does not contain all UTF characters, and 2) some bytes in
ISO-8859-1 do not represent any characters. You could however use a column
defined as CHARACTER SET OCTETS to store the byte representation of a UTF
string, but then you would need to take care of decoding yourself.

> Now The wrong choice about utf8 in firebird
> Firebird consider than it need 4 bytes (to confirm i was thinking 3)
> to store 1 utf8 character that true when we speak about cyrillic
> char for exemple but not when we speak about latin language (french,
> italian,
> spanish, etc.). In these language most of the char need only one byte
> in utf8 (a, b, c, etc.) and only around 20% (é, à, etc.) of the
> char need 2 bytes in UTF8

UTF-8 is a variable encoding that requires 1 to 4 bytes to encode
characters (theoretically 5 and 6 bytes is possible as well, but this is
unused for compatibility with the number characters that can be encoded in
UTF-16). This means that Firebird will use upto 4 bytes per character in
the DB and - afaik - 4 bytes per character in memory because of the way the
memory buffer is allocated.

> now let thing i target such country (portugal, spain, france, italian,
> etc..)
> what kind of charset will best fit my database ? of course UTF8 ! but is
> it neccessary that when i declare a varchar(255) for my varchar column
> firebird handle in background a varchar(1250) ?? absolutely not
> as in these language most of the char need only one byte to be encoded
> in UTF8

If you target Portugal, Spain, France and Italy, then ISO-8859-1 should be
enough for your needs.

Firebird in the background does not handle it as VARCHAR(1250) because SQL
standards define that the size of a CHAR/VARCHAR is defined in
codepoints/characters, not bytes. It can however use upto 1020 bytes with
UTF8 as opposed to 255 bytes with a single byte characterset.

> now you will say me: is their any penalty for this ? after all varchar
> column
> are compressed ?

Unfortunately, as Ann indicates, the RLE used by Firebird is per byte, and
not per character. This means that the compression is less efficient then
with single byte charactersets because of the way codepoints above 127 are
encoded, and I believe that there remaining 0x00 bytes at the end of the
string are also RLE encoded and stored (which I think is something which
could and should be changed).

> yep, their is and i do some bechnmark to prouve it
>
> 1rt i create 2 database, one in iso8859_1 and the other in UTF8

> ...

> and to finish i add (only) 64000 reccords in both table (only with
> varchar containing ascii between a..z)

What is the exact content and length and are their repeating characters in
it?

> you can use my bench tool software to do the test
> http://sourceforge.net/projects/alcinoe/files/alsqlbenchmark/1.01/
>
> at the end:
> the size of the test_iso88591.fdb is 264 MB
> the size of the test_utf8.fdb is 403 MB
>
> so the utf8 database is around 35% more bigger than the ISO8859_1

database!

What happens when you backup and restore the databases?

> to bench the speed, you must take a great care of the windows cache, the

> fragmentation of the file, etc.. not so easy to do ...
>
> but for exemple simply do
>
> select count(*) from TEST_A
> in iso8859_1: 212 ms
> in utf8: 382 ms !!! UP to 80% more slower !!!!

I am not 100%, but this probably has to do with the fact that Firebird
will need to allocate larger buffers in memory for UTF-8 characters.

> so you understand the big probleme is that firebird use 4 bytes (or 3 i
> m still not sure) to handle UTF8 char
> even when 1 bytes will be enalf for 80% of the char
>
> I even not speak about the index size that in utf8 are 4 time more
> smaller than in ISO8859_1 !
>
> how to correct this ? FIRST UTF8 is ASCII ! consider UTF8 like html
> where char like é are encoded in é it's still
> just ASCII at the base !

UTF8 is not ASCII, it is just that the first 127 codepoints have the exact
same encoding as ASCII.

> when i declare in utf8 varchar(250) i want to reserve 250 bytes not 1000

> bytes and i know (like for exemple in html)
> that some char can be encoded in more than one byte! if i know that i
> will handle russian char, i will set UP
> as varchar(750) and if i know that i will handle only latin language i
> will set up to varchar(300) ...
> this setup must be done only by the database administrator ...

That is not how it is supposed to work. You define the size in characters
as specified in the standards, not in byte. If that is a problem for you,
you should look at using CHARACTER SET OCTETS and handle your own decoding
and encoding from and to UTF8.

I think things should (and probably can) be improved, but not by breaking
the SQL standards.

Mark