Subject Record Encoding
Author Jim Starkey
As an experiment, I tried encoding all records (excluding blobs) in one
of my production databases. The current record structure is similar to
Firebird, though run length encoding is based on two byte units rather
than one. The database may be more textual than some, but with a good
sprinkling of database. Virtual all primary and foreign keys are 32
integers generated by sequences. There are no wildly overspecified
fields. Since Netfrastructure has fewer semantic differences between
blobs and text, the scheme is probably slightly blob intensive than a
Firebird database. A significant difference in data demographics,
however, is that fixed length strings are all but unheard of in
Netfrastructure.

Number records: 676,643
Current compressed size (on disk): 74,793,858
Encoded size (on disk): 46,342,823
Current decompressed size (in memory): 206,762,788
Encoded size (in memory): 58,663,007


The difference between on-disk and in-memory encoded sizes is a vector
of 16 bit words containing known offsets of physical fields within the
record encoding.

Run length encoded on top of data stream encoding looks like a waste of
time. Other than trailing blanks in fixed length streams and
significant runs of null, nothing is likely to repeat. I do think that
if an appropriate scheme can be found, additional on-disk compression
would be a benefit, especially for character sets that map into
multi-byte utf-8 characters. I'm taking another look at rfc 1951
(DEFLATE) compression to see if it or a variant might do the trick.


--

Jim Starkey
Netfrastructure, Inc.
978 526-1376