Subject | UTF8 database : how to speed up the database by up to 100% ! |
---|---|
Author | Vander Clock Stephane |
Post date | 2012-01-05T12:55:44Z |
Hello,
Why UTF8 database is not a good choice
read carrefully this post and you will see how to speed by up
to 100% you database (if you use UTF8 off course)
I do some benchmark of UTF8 database vs ISO8859_1 database.
I think the way that Firbird handle utf8 database is not optimal
let speak first about the utf8
UTF8 it's just a way to encode special character like è à etc ..
for this utf8 will use combination of char upper than ascii #127
In this way, and it's not the less, UTF8 stay compatible with all
software that work with 8 bit string. thank to this we can use
a firebird database defined as iso8859_1 to store UTF8
Now The wrong choice about utf8 in firebird
Firebird consider than it need 4 bytes (to confirm i was thinking 3)
to store 1 utf8 character that true when we speak about cyrillic
char for exemple but not when we speak about latin language (french,
italian,
spanish, etc.). In these language most of the char need only one byte
in utf8 (a, b, c, etc.) and only around 20% (é, à, etc.) of the
char need 2 bytes in UTF8
now let thing i target such country (portugal, spain, france, italian,
etc..)
what kind of charset will best fit my database ? of course UTF8 ! but is
it neccessary that when i declare a varchar(255) for my varchar column
firebird handle in background a varchar(1250) ?? absolutely not
as in these language most of the char need only one byte to be encoded
in UTF8
now you will say me: is their any penalty for this ? after all varchar
column
are compressed ?
yep, their is and i do some bechnmark to prouve it
1rt i create 2 database, one in iso8859_1 and the other in UTF8
CREATE DATABASE 'c:\test_utf8.fdb'
USER 'sysdba'
PASSWORD 'masterkey'
PAGE_SIZE 4096
DEFAULT CHARACTER SET UTF8;
CREATE DATABASE 'c:\test_iso88591.fdb'
USER 'sysdba'
PASSWORD 'masterkey'
PAGE_SIZE 4096
DEFAULT CHARACTER SET ISO8859_1;
after i add in each database 2 simple table
CREATE TABLE TEST_A(
DATA1 VARCHAR(2000),
DATA2 VARCHAR(2000),
DATA3 VARCHAR(2000),
DATA4 VARCHAR(2000),
DATA5 VARCHAR(2000),
DATA6 VARCHAR(5000)
);
CREATE TABLE TEST_b(
DATA01 VARCHAR(100),
DATA02 VARCHAR(100),
DATA03 VARCHAR(100),
DATA04 VARCHAR(100),
DATA05 VARCHAR(100),
DATA06 VARCHAR(100),
DATA07 VARCHAR(100),
DATA08 VARCHAR(100),
DATA09 VARCHAR(100),
DATA10 VARCHAR(100),
DATA11 VARCHAR(100),
DATA12 VARCHAR(100),
DATA13 VARCHAR(100),
DATA14 VARCHAR(100),
DATA15 VARCHAR(100),
DATA16 VARCHAR(100),
DATA17 VARCHAR(100),
DATA18 VARCHAR(100),
DATA19 VARCHAR(100),
DATA20 VARCHAR(100),
DATA21 VARCHAR(100),
DATA22 VARCHAR(100),
DATA23 VARCHAR(100),
DATA24 VARCHAR(100),
DATA25 VARCHAR(100),
DATA26 VARCHAR(100),
DATA27 VARCHAR(100),
DATA28 VARCHAR(100),
DATA29 VARCHAR(100),
DATA30 VARCHAR(100),
DATA31 VARCHAR(100),
DATA32 VARCHAR(100),
DATA33 VARCHAR(100),
DATA34 VARCHAR(100),
DATA35 VARCHAR(100),
DATA36 VARCHAR(100),
DATA37 VARCHAR(100),
DATA38 VARCHAR(100),
DATA39 VARCHAR(100),
DATA40 VARCHAR(100),
DATA41 VARCHAR(100),
DATA42 VARCHAR(100),
DATA43 VARCHAR(100),
DATA44 VARCHAR(100),
DATA45 VARCHAR(100),
DATA46 VARCHAR(100),
DATA47 VARCHAR(100),
DATA48 VARCHAR(100),
DATA49 VARCHAR(100),
DATA50 VARCHAR(100)
);
and to finish i add (only) 64000 reccords in both table (only with
varchar containing ascii between a..z)
you can use my bench tool software to do the test
http://sourceforge.net/projects/alcinoe/files/alsqlbenchmark/1.01/
at the end:
the size of the test_iso88591.fdb is 264 MB
the size of the test_utf8.fdb is 403 MB
so the utf8 database is around 35% more bigger than the ISO8859_1 database!
this difference seam to increase with the amount of data (number of
varchar column) in the database
the fact is that the UTF8 database can be up to 100% more bigger in size
than the ISO8859_1 database for the SAME DATA (low ascii char) !
not let speak about the speed
in fact no surprise at all, more bigger file size mean more I/O, mean
more cache, mean more memory, mean more slower :(
to bench the speed, you must take a great care of the windows cache, the
fragmentation of the file, etc.. not so easy to do ...
but for exemple simply do
select count(*) from TEST_A
in iso8859_1: 212 ms
in utf8: 382 ms !!! UP to 80% more slower !!!!
so you understand the big probleme is that firebird use 4 bytes (or 3 i
m still not sure) to handle UTF8 char
even when 1 bytes will be enalf for 80% of the char :(
I even not speak about the index size that in utf8 are 4 time more
smaller than in ISO8859_1 !
how to correct this ? FIRST UTF8 is ASCII ! consider UTF8 like html
where char like é are encoded in é it's still
just ASCII at the base !
when i declare in utf8 varchar(250) i want to reserve 250 bytes not 1000
bytes and i know (like for exemple in html)
that some char can be encoded in more than one byte! if i know that i
will handle russian char, i will set UP
as varchar(750) and if i know that i will handle only latin language i
will set up to varchar(300) ...
this setup must be done only by the database administrator ...
stéphane
Why UTF8 database is not a good choice
read carrefully this post and you will see how to speed by up
to 100% you database (if you use UTF8 off course)
I do some benchmark of UTF8 database vs ISO8859_1 database.
I think the way that Firbird handle utf8 database is not optimal
let speak first about the utf8
UTF8 it's just a way to encode special character like è à etc ..
for this utf8 will use combination of char upper than ascii #127
In this way, and it's not the less, UTF8 stay compatible with all
software that work with 8 bit string. thank to this we can use
a firebird database defined as iso8859_1 to store UTF8
Now The wrong choice about utf8 in firebird
Firebird consider than it need 4 bytes (to confirm i was thinking 3)
to store 1 utf8 character that true when we speak about cyrillic
char for exemple but not when we speak about latin language (french,
italian,
spanish, etc.). In these language most of the char need only one byte
in utf8 (a, b, c, etc.) and only around 20% (é, à, etc.) of the
char need 2 bytes in UTF8
now let thing i target such country (portugal, spain, france, italian,
etc..)
what kind of charset will best fit my database ? of course UTF8 ! but is
it neccessary that when i declare a varchar(255) for my varchar column
firebird handle in background a varchar(1250) ?? absolutely not
as in these language most of the char need only one byte to be encoded
in UTF8
now you will say me: is their any penalty for this ? after all varchar
column
are compressed ?
yep, their is and i do some bechnmark to prouve it
1rt i create 2 database, one in iso8859_1 and the other in UTF8
CREATE DATABASE 'c:\test_utf8.fdb'
USER 'sysdba'
PASSWORD 'masterkey'
PAGE_SIZE 4096
DEFAULT CHARACTER SET UTF8;
CREATE DATABASE 'c:\test_iso88591.fdb'
USER 'sysdba'
PASSWORD 'masterkey'
PAGE_SIZE 4096
DEFAULT CHARACTER SET ISO8859_1;
after i add in each database 2 simple table
CREATE TABLE TEST_A(
DATA1 VARCHAR(2000),
DATA2 VARCHAR(2000),
DATA3 VARCHAR(2000),
DATA4 VARCHAR(2000),
DATA5 VARCHAR(2000),
DATA6 VARCHAR(5000)
);
CREATE TABLE TEST_b(
DATA01 VARCHAR(100),
DATA02 VARCHAR(100),
DATA03 VARCHAR(100),
DATA04 VARCHAR(100),
DATA05 VARCHAR(100),
DATA06 VARCHAR(100),
DATA07 VARCHAR(100),
DATA08 VARCHAR(100),
DATA09 VARCHAR(100),
DATA10 VARCHAR(100),
DATA11 VARCHAR(100),
DATA12 VARCHAR(100),
DATA13 VARCHAR(100),
DATA14 VARCHAR(100),
DATA15 VARCHAR(100),
DATA16 VARCHAR(100),
DATA17 VARCHAR(100),
DATA18 VARCHAR(100),
DATA19 VARCHAR(100),
DATA20 VARCHAR(100),
DATA21 VARCHAR(100),
DATA22 VARCHAR(100),
DATA23 VARCHAR(100),
DATA24 VARCHAR(100),
DATA25 VARCHAR(100),
DATA26 VARCHAR(100),
DATA27 VARCHAR(100),
DATA28 VARCHAR(100),
DATA29 VARCHAR(100),
DATA30 VARCHAR(100),
DATA31 VARCHAR(100),
DATA32 VARCHAR(100),
DATA33 VARCHAR(100),
DATA34 VARCHAR(100),
DATA35 VARCHAR(100),
DATA36 VARCHAR(100),
DATA37 VARCHAR(100),
DATA38 VARCHAR(100),
DATA39 VARCHAR(100),
DATA40 VARCHAR(100),
DATA41 VARCHAR(100),
DATA42 VARCHAR(100),
DATA43 VARCHAR(100),
DATA44 VARCHAR(100),
DATA45 VARCHAR(100),
DATA46 VARCHAR(100),
DATA47 VARCHAR(100),
DATA48 VARCHAR(100),
DATA49 VARCHAR(100),
DATA50 VARCHAR(100)
);
and to finish i add (only) 64000 reccords in both table (only with
varchar containing ascii between a..z)
you can use my bench tool software to do the test
http://sourceforge.net/projects/alcinoe/files/alsqlbenchmark/1.01/
at the end:
the size of the test_iso88591.fdb is 264 MB
the size of the test_utf8.fdb is 403 MB
so the utf8 database is around 35% more bigger than the ISO8859_1 database!
this difference seam to increase with the amount of data (number of
varchar column) in the database
the fact is that the UTF8 database can be up to 100% more bigger in size
than the ISO8859_1 database for the SAME DATA (low ascii char) !
not let speak about the speed
in fact no surprise at all, more bigger file size mean more I/O, mean
more cache, mean more memory, mean more slower :(
to bench the speed, you must take a great care of the windows cache, the
fragmentation of the file, etc.. not so easy to do ...
but for exemple simply do
select count(*) from TEST_A
in iso8859_1: 212 ms
in utf8: 382 ms !!! UP to 80% more slower !!!!
so you understand the big probleme is that firebird use 4 bytes (or 3 i
m still not sure) to handle UTF8 char
even when 1 bytes will be enalf for 80% of the char :(
I even not speak about the index size that in utf8 are 4 time more
smaller than in ISO8859_1 !
how to correct this ? FIRST UTF8 is ASCII ! consider UTF8 like html
where char like é are encoded in é it's still
just ASCII at the base !
when i declare in utf8 varchar(250) i want to reserve 250 bytes not 1000
bytes and i know (like for exemple in html)
that some char can be encoded in more than one byte! if i know that i
will handle russian char, i will set UP
as varchar(750) and if i know that i will handle only latin language i
will set up to varchar(300) ...
this setup must be done only by the database administrator ...
stéphane