Subject | Re: UTF8 database : how to speed up the database by up to 100% ! |
---|---|
Author | ma_golyo |
Post date | 2012-01-06T07:02:33Z |
Hi!
You tested something wrong. Our real database test shows no such thing. (not only ascii, but hungarian characters too)
Size : WIN1250 : 2.246 M, UTF8 : 2.418 M
Speed : Nothing changed, same speed.
You tested something wrong. Our real database test shows no such thing. (not only ascii, but hungarian characters too)
Size : WIN1250 : 2.246 M, UTF8 : 2.418 M
Speed : Nothing changed, same speed.
--- In firebird-support@yahoogroups.com, Vander Clock Stephane <svanderclock@...> wrote:
>
> Hello,
>
> Why UTF8 database is not a good choice
>
> read carrefully this post and you will see how to speed by up
> to 100% you database (if you use UTF8 off course)
>
> I do some benchmark of UTF8 database vs ISO8859_1 database.
> I think the way that Firbird handle utf8 database is not optimal
>
> let speak first about the utf8
> UTF8 it's just a way to encode special character like è à etc ..
> for this utf8 will use combination of char upper than ascii #127
> In this way, and it's not the less, UTF8 stay compatible with all
> software that work with 8 bit string. thank to this we can use
> a firebird database defined as iso8859_1 to store UTF8
>
> Now The wrong choice about utf8 in firebird
> Firebird consider than it need 4 bytes (to confirm i was thinking 3)
> to store 1 utf8 character that true when we speak about cyrillic
> char for exemple but not when we speak about latin language (french,
> italian,
> spanish, etc.). In these language most of the char need only one byte
> in utf8 (a, b, c, etc.) and only around 20% (é, à , etc.) of the
> char need 2 bytes in UTF8
>
> now let thing i target such country (portugal, spain, france, italian,
> etc..)
> what kind of charset will best fit my database ? of course UTF8 ! but is
> it neccessary that when i declare a varchar(255) for my varchar column
> firebird handle in background a varchar(1250) ?? absolutely not
> as in these language most of the char need only one byte to be encoded
> in UTF8
>
> now you will say me: is their any penalty for this ? after all varchar
> column
> are compressed ?
>
> yep, their is and i do some bechnmark to prouve it
>
> 1rt i create 2 database, one in iso8859_1 and the other in UTF8
>
> CREATE DATABASE 'c:\test_utf8.fdb'
> USER 'sysdba'
> PASSWORD 'masterkey'
> PAGE_SIZE 4096
> DEFAULT CHARACTER SET UTF8;
>
> CREATE DATABASE 'c:\test_iso88591.fdb'
> USER 'sysdba'
> PASSWORD 'masterkey'
> PAGE_SIZE 4096
> DEFAULT CHARACTER SET ISO8859_1;
>
> after i add in each database 2 simple table
>
> CREATE TABLE TEST_A(
> DATA1 VARCHAR(2000),
> DATA2 VARCHAR(2000),
> DATA3 VARCHAR(2000),
> DATA4 VARCHAR(2000),
> DATA5 VARCHAR(2000),
> DATA6 VARCHAR(5000)
> );
>
> CREATE TABLE TEST_b(
> DATA01 VARCHAR(100),
> DATA02 VARCHAR(100),
> DATA03 VARCHAR(100),
> DATA04 VARCHAR(100),
> DATA05 VARCHAR(100),
> DATA06 VARCHAR(100),
> DATA07 VARCHAR(100),
> DATA08 VARCHAR(100),
> DATA09 VARCHAR(100),
> DATA10 VARCHAR(100),
> DATA11 VARCHAR(100),
> DATA12 VARCHAR(100),
> DATA13 VARCHAR(100),
> DATA14 VARCHAR(100),
> DATA15 VARCHAR(100),
> DATA16 VARCHAR(100),
> DATA17 VARCHAR(100),
> DATA18 VARCHAR(100),
> DATA19 VARCHAR(100),
> DATA20 VARCHAR(100),
> DATA21 VARCHAR(100),
> DATA22 VARCHAR(100),
> DATA23 VARCHAR(100),
> DATA24 VARCHAR(100),
> DATA25 VARCHAR(100),
> DATA26 VARCHAR(100),
> DATA27 VARCHAR(100),
> DATA28 VARCHAR(100),
> DATA29 VARCHAR(100),
> DATA30 VARCHAR(100),
> DATA31 VARCHAR(100),
> DATA32 VARCHAR(100),
> DATA33 VARCHAR(100),
> DATA34 VARCHAR(100),
> DATA35 VARCHAR(100),
> DATA36 VARCHAR(100),
> DATA37 VARCHAR(100),
> DATA38 VARCHAR(100),
> DATA39 VARCHAR(100),
> DATA40 VARCHAR(100),
> DATA41 VARCHAR(100),
> DATA42 VARCHAR(100),
> DATA43 VARCHAR(100),
> DATA44 VARCHAR(100),
> DATA45 VARCHAR(100),
> DATA46 VARCHAR(100),
> DATA47 VARCHAR(100),
> DATA48 VARCHAR(100),
> DATA49 VARCHAR(100),
> DATA50 VARCHAR(100)
> );
>
> and to finish i add (only) 64000 reccords in both table (only with
> varchar containing ascii between a..z)
>
> you can use my bench tool software to do the test
> http://sourceforge.net/projects/alcinoe/files/alsqlbenchmark/1.01/
>
> at the end:
> the size of the test_iso88591.fdb is 264 MB
> the size of the test_utf8.fdb is 403 MB
>
> so the utf8 database is around 35% more bigger than the ISO8859_1 database!
>
> this difference seam to increase with the amount of data (number of
> varchar column) in the database
>
> the fact is that the UTF8 database can be up to 100% more bigger in size
> than the ISO8859_1 database for the SAME DATA (low ascii char) !
>
> not let speak about the speed
>
> in fact no surprise at all, more bigger file size mean more I/O, mean
> more cache, mean more memory, mean more slower :(
>
> to bench the speed, you must take a great care of the windows cache, the
> fragmentation of the file, etc.. not so easy to do ...
>
> but for exemple simply do
>
> select count(*) from TEST_A
> in iso8859_1: 212 ms
> in utf8: 382 ms !!! UP to 80% more slower !!!!
>
> so you understand the big probleme is that firebird use 4 bytes (or 3 i
> m still not sure) to handle UTF8 char
> even when 1 bytes will be enalf for 80% of the char :(
>
> I even not speak about the index size that in utf8 are 4 time more
> smaller than in ISO8859_1 !
>
> how to correct this ? FIRST UTF8 is ASCII ! consider UTF8 like html
> where char like é are encoded in é it's still
> just ASCII at the base !
>
> when i declare in utf8 varchar(250) i want to reserve 250 bytes not 1000
> bytes and i know (like for exemple in html)
> that some char can be encoded in more than one byte! if i know that i
> will handle russian char, i will set UP
> as varchar(750) and if i know that i will handle only latin language i
> will set up to varchar(300) ...
> this setup must be done only by the database administrator ...
>
> stéphane
>