Subject Re: [firebird-support] Re: UTF8 database : how to speed up the database by up to 100% !
Author Vander Clock Stephane
My database is in front of me ! the size is well written ... No mistake
possible
for the same amount of bytes written in both database, the utf8 can be
from 20% to up to 100% more bigger than
the ISO88591 database ... after yes the speed test can be influenced by
lot of factor but size not !
and bigger size = bigger i/o = more slower that is a fact :)

On 1/6/2012 11:02 AM, ma_golyo wrote:
>
> Hi!
>
> You tested something wrong. Our real database test shows no such
> thing. (not only ascii, but hungarian characters too)
>
> Size : WIN1250 : 2.246 M, UTF8 : 2.418 M
> Speed : Nothing changed, same speed.
>
> --- In firebird-support@yahoogroups.com
> <mailto:firebird-support%40yahoogroups.com>, Vander Clock Stephane
> <svanderclock@...> wrote:
> >
> > Hello,
> >
> > Why UTF8 database is not a good choice
> >
> > read carrefully this post and you will see how to speed by up
> > to 100% you database (if you use UTF8 off course)
> >
> > I do some benchmark of UTF8 database vs ISO8859_1 database.
> > I think the way that Firbird handle utf8 database is not optimal
> >
> > let speak first about the utf8
> > UTF8 it's just a way to encode special character like è à etc ..
> > for this utf8 will use combination of char upper than ascii #127
> > In this way, and it's not the less, UTF8 stay compatible with all
> > software that work with 8 bit string. thank to this we can use
> > a firebird database defined as iso8859_1 to store UTF8
> >
> > Now The wrong choice about utf8 in firebird
> > Firebird consider than it need 4 bytes (to confirm i was thinking 3)
> > to store 1 utf8 character that true when we speak about cyrillic
> > char for exemple but not when we speak about latin language (french,
> > italian,
> > spanish, etc.). In these language most of the char need only one byte
> > in utf8 (a, b, c, etc.) and only around 20% (é, à , etc.) of the
> > char need 2 bytes in UTF8
> >
> > now let thing i target such country (portugal, spain, france, italian,
> > etc..)
> > what kind of charset will best fit my database ? of course UTF8 ! but is
> > it neccessary that when i declare a varchar(255) for my varchar column
> > firebird handle in background a varchar(1250) ?? absolutely not
> > as in these language most of the char need only one byte to be encoded
> > in UTF8
> >
> > now you will say me: is their any penalty for this ? after all varchar
> > column
> > are compressed ?
> >
> > yep, their is and i do some bechnmark to prouve it
> >
> > 1rt i create 2 database, one in iso8859_1 and the other in UTF8
> >
> > CREATE DATABASE 'c:\test_utf8.fdb'
> > USER 'sysdba'
> > PASSWORD 'masterkey'
> > PAGE_SIZE 4096
> > DEFAULT CHARACTER SET UTF8;
> >
> > CREATE DATABASE 'c:\test_iso88591.fdb'
> > USER 'sysdba'
> > PASSWORD 'masterkey'
> > PAGE_SIZE 4096
> > DEFAULT CHARACTER SET ISO8859_1;
> >
> > after i add in each database 2 simple table
> >
> > CREATE TABLE TEST_A(
> > DATA1 VARCHAR(2000),
> > DATA2 VARCHAR(2000),
> > DATA3 VARCHAR(2000),
> > DATA4 VARCHAR(2000),
> > DATA5 VARCHAR(2000),
> > DATA6 VARCHAR(5000)
> > );
> >
> > CREATE TABLE TEST_b(
> > DATA01 VARCHAR(100),
> > DATA02 VARCHAR(100),
> > DATA03 VARCHAR(100),
> > DATA04 VARCHAR(100),
> > DATA05 VARCHAR(100),
> > DATA06 VARCHAR(100),
> > DATA07 VARCHAR(100),
> > DATA08 VARCHAR(100),
> > DATA09 VARCHAR(100),
> > DATA10 VARCHAR(100),
> > DATA11 VARCHAR(100),
> > DATA12 VARCHAR(100),
> > DATA13 VARCHAR(100),
> > DATA14 VARCHAR(100),
> > DATA15 VARCHAR(100),
> > DATA16 VARCHAR(100),
> > DATA17 VARCHAR(100),
> > DATA18 VARCHAR(100),
> > DATA19 VARCHAR(100),
> > DATA20 VARCHAR(100),
> > DATA21 VARCHAR(100),
> > DATA22 VARCHAR(100),
> > DATA23 VARCHAR(100),
> > DATA24 VARCHAR(100),
> > DATA25 VARCHAR(100),
> > DATA26 VARCHAR(100),
> > DATA27 VARCHAR(100),
> > DATA28 VARCHAR(100),
> > DATA29 VARCHAR(100),
> > DATA30 VARCHAR(100),
> > DATA31 VARCHAR(100),
> > DATA32 VARCHAR(100),
> > DATA33 VARCHAR(100),
> > DATA34 VARCHAR(100),
> > DATA35 VARCHAR(100),
> > DATA36 VARCHAR(100),
> > DATA37 VARCHAR(100),
> > DATA38 VARCHAR(100),
> > DATA39 VARCHAR(100),
> > DATA40 VARCHAR(100),
> > DATA41 VARCHAR(100),
> > DATA42 VARCHAR(100),
> > DATA43 VARCHAR(100),
> > DATA44 VARCHAR(100),
> > DATA45 VARCHAR(100),
> > DATA46 VARCHAR(100),
> > DATA47 VARCHAR(100),
> > DATA48 VARCHAR(100),
> > DATA49 VARCHAR(100),
> > DATA50 VARCHAR(100)
> > );
> >
> > and to finish i add (only) 64000 reccords in both table (only with
> > varchar containing ascii between a..z)
> >
> > you can use my bench tool software to do the test
> > http://sourceforge.net/projects/alcinoe/files/alsqlbenchmark/1.01/
> >
> > at the end:
> > the size of the test_iso88591.fdb is 264 MB
> > the size of the test_utf8.fdb is 403 MB
> >
> > so the utf8 database is around 35% more bigger than the ISO8859_1
> database!
> >
> > this difference seam to increase with the amount of data (number of
> > varchar column) in the database
> >
> > the fact is that the UTF8 database can be up to 100% more bigger in
> size
> > than the ISO8859_1 database for the SAME DATA (low ascii char) !
> >
> > not let speak about the speed
> >
> > in fact no surprise at all, more bigger file size mean more I/O, mean
> > more cache, mean more memory, mean more slower :(
> >
> > to bench the speed, you must take a great care of the windows cache,
> the
> > fragmentation of the file, etc.. not so easy to do ...
> >
> > but for exemple simply do
> >
> > select count(*) from TEST_A
> > in iso8859_1: 212 ms
> > in utf8: 382 ms !!! UP to 80% more slower !!!!
> >
> > so you understand the big probleme is that firebird use 4 bytes (or 3 i
> > m still not sure) to handle UTF8 char
> > even when 1 bytes will be enalf for 80% of the char :(
> >
> > I even not speak about the index size that in utf8 are 4 time more
> > smaller than in ISO8859_1 !
> >
> > how to correct this ? FIRST UTF8 is ASCII ! consider UTF8 like html
> > where char like é are encoded in é it's still
> > just ASCII at the base !
> >
> > when i declare in utf8 varchar(250) i want to reserve 250 bytes not
> 1000
> > bytes and i know (like for exemple in html)
> > that some char can be encoded in more than one byte! if i know that i
> > will handle russian char, i will set UP
> > as varchar(750) and if i know that i will handle only latin language i
> > will set up to varchar(300) ...
> > this setup must be done only by the database administrator ...
> >
> > stéphane
> >
>
>


[Non-text portions of this message have been removed]