Subject Re: [firebird-support] UTF8 in firebird ?
Author Ann Harrison
Dear Geoff,
>
> I am far from convinced that your testing reveals real-world
> differences between the current UTF8 implementation vs any
> practical alternative (which neither ISO_8859 nor OCTETS
> represent).

Stephane's tests show that when you carry a lot of extra space around
in strings, it slows Firebird somewhat. I think his example was
unusually severe because he over-estimates the number of characters in
his fields.

One of his test tables contains 5 varchar[2000] columns and one
varchar[5000] column. The other has 3 varchar[100] columns. He said
"the utf8 database is around 35% more bigger than the ISO8859_1
database! this difference seems to increase with the amount of data
(number of varchar columns) in the database." That's quite plausible,
given the way Firebird's RLE works. Runs of the same byte value
compress better than alternating mixed bytes and identical ones. The
varchar[100] is stored as 100 mixed bytes followed by 300 identical
bytes.

A modern database ought to be able to deal with UTF8 better than that.
Firebird's RLE implementation was written in 1984, before multi-byte
characters were common, before varialble length encodings, and while
the limits on disk size dictated more compact data.

One alternative is to go to an actual variable length encoding which I
believe Jim described on the architecture list about five years ago.
Each field starts with an encoding byte, then some number of length
bytes (possibly zero) and some number of data bytes (also possibly
zero). Binary numbers between -10 and 30 (or so) are completely
described in the encoding byte. Because data is stored at its actual
length, there's no penalty for using UTF-8 for Western European
alphabets. However, variable length encoding changes all the data
handling in Firebird and would probably also require keeping an vector
of field offsets on buffers to avoid having to decode the entire
record to find the value of a particular field.

Another alternative is to use a modern compression algorithm, if one
can be found that performs well. Snappy
(http://code.google.com/p/snappy/) from Google is one possibility.
RLE was chosen originally because the cost of compression and
decompression was very low. Other schemes were available at the time,
but their overhead dwarfed the savings in I/O time from storing
compressed data. The balance of I/O cost and CPU is different now,
and a lot of smart people have worked on compression since 1984.
Dropping in a better compression package would be simple compared with
variable length encoding.

Good luck,

Ann