Subject | Re: [firebird-support] UTF8 in firebird ? |
---|---|
Author | Ann Harrison |
Post date | 2012-01-05T19:31:24Z |
Stéphane,
thought about variable length data encoding, and the combination of
fixed length allocations and run-length compression is less than
optimal with UTF8.
I almost always get this wrong, but my recollection is that there's
one byte of length for every 127 characters. A run of 127 identical
bytes compresses to two bytes, so the three hundred unused bytes in
your 100 character strings turns into 6 bytes, when compressed.
If you're only handling Western European alphabets, you could probably
use Latin-1 (which has a formal name that eludes me at the moment.)
If you need Greek and Cyrillic alphabets, the choices are harder.
Blobs are stored at their actual length, so they won't suffer from
UTF8 inflation, but there's the overhead of the blob pointer and blob
header, so that's an awkward solution.
The longer term solution is for the Firebird project to look at its
data representation and find something that works better with UTF8.
Good luck,
Ann
>Firebird's compression algorithm was designed before anyone had
> I want to know if UTF8 is a good in Firebird so i do some
> tests. can you gave me your opinion ?
thought about variable length data encoding, and the combination of
fixed length allocations and run-length compression is less than
optimal with UTF8.
I almost always get this wrong, but my recollection is that there's
one byte of length for every 127 characters. A run of 127 identical
bytes compresses to two bytes, so the three hundred unused bytes in
your 100 character strings turns into 6 bytes, when compressed.
If you're only handling Western European alphabets, you could probably
use Latin-1 (which has a formal name that eludes me at the moment.)
If you need Greek and Cyrillic alphabets, the choices are harder.
Blobs are stored at their actual length, so they won't suffer from
UTF8 inflation, but there's the overhead of the blob pointer and blob
header, so that's an awkward solution.
The longer term solution is for the Firebird project to look at its
data representation and find something that works better with UTF8.
Good luck,
Ann