Subject RE: [IB-Architect] Next ODS change (was: System table change)
Author Ann Harrison
At 11:48 AM 11/7/2000 -0700, David Berg wrote:
>On the subject of compression algorythms, have you considered dictionary
>based schemes?

Charlie told me that measurements showed that computing a checksum
for each page immediately before writing and immediately after reading
took 10% of the server CPU time. Since partial writes are very rare,
they stopped checksumming pages. I think about that every time I
starting thinking about elaborate compression or encryption schemes.

>First, it would be particularly particularly valuable in dealing with large
>text objects, such as XML and HTML pages (which have lots of repetition).

As a blob filter, dictionary-based compression would be great and can
be added at any time.

>Next down, dictionary compression of records would allow for common
>repeating patterns to be compressed together. For example, if most men pay
>with a visa card, and the fields SEX, CHARGE TYPE, and CARD# are in
>sequence, then a dictionary might treat "MVISA 4" as a single dictionary
>token (4 is the first digit of all Visa cards). Likewise women who pay cash
>would compress "FCASH " as a single token. (No sexual
>stereotypes intended)

What InterBase currently uses is a run-length compression which is
cheap to compute. It uses a signed byte to indicate length. A positive
value of <n> shows that the next <n> characters are part of the record.
A negative value of <n> shows that the next character (byte value) should
be repeated <n> times.

If the length indicator were a word, an empty varchar(32000) would be
stored as three bytes, rather than the current 300 or so. The algorithm
might use the high order bit to indicate whether the indicator was one
byte or two and the next bit to indicate a run of some character. Slightly
less good for long empty strings, better for short empty strings.

>For example a field declared as "Char(20) Dictionary(8)" would appear to the
>outside world as a 20 character Alpha, but the internal storage would be an
>8 bit reference to a dictionary (thus allowing up to 255 unique 20 character
>values in the field). (You can actually think of this as two tables with an
>implied updatable join, although I suspect the performance would be faster
>using special purpose dictionary code.)

Interesting.

>One of the particularly neat things about field based dictionaries (and
>record based compression in general) is that they can compress keys just as
>effectively. And the more keys stored on a single memory page, the faster
>the lookup.

Ummm. Right. It is significantly important that index keys (in their
key incarnation - not as they appear in the row) sort correctly byte-wise.
Prefix compression works exceedingly well for index keys ... Check the
stats on almost any index - the average key size will be much shorter than
the declared key length.

My thoughts, anyway.

Regards,

Ann