firebird-architect - Re: [Firebird-Architect] Record Encoding

Subject	Re: [Firebird-Architect] Record Encoding
Author	Arno Brinkman
Post date	2005-05-12T19:05:06Z

Hi,

> As an experiment, I tried encoding all records (excluding blobs) in one
> of my production databases. The current record structure is similar to
> Firebird, though run length encoding is based on two byte units rather
> than one.

Why two? I think this won't help much for many data.

> The database may be more textual than some, but with a good
> sprinkling of database. Virtual all primary and foreign keys are 32
> integers generated by sequences. There are no wildly overspecified
> fields. Since Netfrastructure has fewer semantic differences between
> blobs and text, the scheme is probably slightly blob intensive than a
> Firebird database. A significant difference in data demographics,
> however, is that fixed length strings are all but unheard of in
> Netfrastructure.
>
> Number records: 676,643
> Current compressed size (on disk): 74,793,858
> Encoded size (on disk): 46,342,823
> Current decompressed size (in memory): 206,762,788
> Encoded size (in memory): 58,663,007

Sounds interesting, thus the encoding data has already more compression as the regular RLE (two
bytes). Just to be sure i had understand your encoding proposal, a value with datatype "64-bit
integer" which holds the decimal value "100" takes up only two bytes (1 type and 1 for data).

> The difference between on-disk and in-memory encoded sizes is a vector
> of 16 bit words containing known offsets of physical fields within the
> record encoding.

And this depends on the fields read from the record, because the offsets are set based on need.

> Run length encoded on top of data stream encoding looks like a waste of
> time. Other than trailing blanks in fixed length streams and
> significant runs of null, nothing is likely to repeat.

In fact the encoding is taking away the repeating values :)

> I do think that
> if an appropriate scheme can be found, additional on-disk compression
> would be a benefit, especially for character sets that map into
> multi-byte utf-8 characters. I'm taking another look at rfc 1951
> (DEFLATE) compression to see if it or a variant might do the trick.

The second advantage i see with your record encoding is that delta-versions also keep small (the
same as RLE). When a compression is done over a whole record the delta-versions probably grow
compared with the current sizes. Anyway i think that compression for blob is definitly needed.

Regards,
Arno Brinkman
ABVisie

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Firebird open source database (based on IB-OE) with many SQL-99 features :
http://www.firebirdsql.org
http://www.firebirdsql.info
http://www.fingerbird.de/
http://www.comunidade-firebird.org/

Support list for Interbase and Firebird users :
firebird-support@yahoogroups.com

Nederlandse firebird nieuwsgroep :
news://newsgroups.firebirdsql.info