Subject Re: [Firebird-Architect] Record Encoding
Author Jim Starkey
Arno Brinkman wrote:

>>Number records: 676,643
>>Current compressed size (on disk): 74,793,858
>>Encoded size (on disk): 46,342,823
>>Current decompressed size (in memory): 206,762,788
>>Encoded size (in memory): 58,663,007
>>
>>
>
>Sounds interesting, thus the encoding data has already more compression as the regular RLE (two
>bytes). Just to be sure i had understand your encoding proposal, a value with datatype "64-bit
>integer" which holds the decimal value "100" takes up only two bytes (1 type and 1 for data).
>
>
Yes. Even better, the 64 bit value "25" takes a single byte. But the
really big gains are in not saving the unused tails of varchars.

>
>
>>The difference between on-disk and in-memory encoded sizes is a vector
>>of 16 bit words containing known offsets of physical fields within the
>>record encoding.
>>
>>
>
>And this depends on the fields read from the record, because the offsets are set based on need.
>
>
Exactly. It parses fields only as far as necessary and resumes where it
left off. I really didn't want to start from the beginning each time
even is the cost of skipping an item is very low.

>
>
>>Run length encoded on top of data stream encoding looks like a waste of
>>time. Other than trailing blanks in fixed length streams and
>>significant runs of null, nothing is likely to repeat.
>>
>>
>
>In fact the encoding is taking away the repeating values :)
>
>
Not the trailing repeating values. I'm not sure of the best way to
handle them in Firebird. I'm inclined toward simple truncation during
encoding, but this requires that the full field be reconstituted at
reference time which, in turn, requires memory to rebuild the field.
It's an ugly problem with no good solution. Maybe the best solution is
to store the trailing blanks with clear documentation on the evils of
fixed length strings.

The density is nice, but the thing that excites me most about the
encoding as an ODS structure is that it utterly eliminates any
dependency on physical field length. I'm toying with the idea of
replacing "varchar (<size>)" with a just "string" in Netfrastructure.
Fixed field lengths made perfectly good sense for punch cards, but I
think we're past that now.

>
>
>The second advantage i see with your record encoding is that delta-versions also keep small (the
>same as RLE). When a compression is done over a whole record the delta-versions probably grow
>compared with the current sizes. Anyway i think that compression for blob is definitly needed.
>
>
>
Hmm, you bring up an interesting point. This encoding scheme rather
blows the existing difference records out of the water. I suspect that
an encoding-aware scheme would be necessary to be efficient.

Straight zlib compression for blobs would probably payoff bigtime, even
more so if decompression were done client side. It would also make
on-the-fly blob searching quite exciting.

--

Jim Starkey
Netfrastructure, Inc.
978 526-1376