Subject Re: [firebird-support] UTF8 in firebird ?
Author Ann Harrison
On Thu, Jan 5, 2012 at 3:21 PM, Mark Rotteveel <mark@...> wrote:

>> now you will say me: is their any penalty for this ? after all varchar
>> column are compressed ?
>
> Unfortunately, as Ann indicates, the RLE used by Firebird is per byte, and
> not per character. This means that the compression is less efficient then
> with single byte charactersets because of the way codepoints above 127 are
> encoded, and I believe that there remaining 0x00 bytes at the end of the
> string are also RLE encoded and stored (which I think is something which
> could and should be changed).
>

Strings, both char and varchar, are blank filled to their maximum size
using whatever the representation of blank is for the character set.
If there were a character set that stores blanks as two or more bytes
that are not identical, Firebird's RLE would be worthless on it.
Storing strings at their actual length rather than their declared
length would be a very good thing, but hard. Data is kept in memory
in byte arrays that hold fields at their full declared length, with no
tags to indicate separation between fields. Much depends on the fact
that the expansion of the RLE produces a record that has each field at
full length. Lopping off the boring blanks would break that
translation and require some sort of indicator of the start of a new
field.

Good luck,

Ann