Subject RE: [IB-Architect] Next ODS change (was: System table change)
Author David Berg
Good points. Thanks for the response.

* 10% CPU time for checksums! Wow. I'd suggest a simpler checksum
algorythm <grin>, but your idea of getting rid of them entirely is probably
better yet.

* Run length compression would be easier/faster to compute. Performance is
the one thing that worries me about using dictionary based compression for
rows, although I would point out that while compressing the data is slow,
decompression is very fast.

Another technique, similar to run length, I've used for optimal storage is a
token based schemes. One byte is reserved for the token, the token
determines what follows (if anything). For example to encode a number I use
one of the following tokens:

0: The number is a zero.
1: The number is a one.
2: The number is minus one.
3: The number is one byte, which follows.
4: The number is two bytes, which follow.
5: The number is four bytes, which follow.
6: The number is eight bytes, which follow.

Storing common values in the token is similar to a hard coded dictionary
technique, coding can be done with a case statement. For strings a similar
technique can be used:

10: The string is nulls.
11: The string is spaces.
12: The string is a CRLF.
13: The string is <=255 characters (followed by one byte length and data).
14: The string is <=65535 characters (followed by two bytes length and
15: The string is <=4GB characters (followed by four bytes length and

This was field based; however, if it was in the middle of text, you could
use a token to represent a repeated character, the character, and it's

You could also use tokens to represent compressed Unicode (where all
following bytes are Unicode characters with null high bytes which have been

I guess the way to look at it is that you can hard code dictionary schemes
to improve performance. You can also take advantage of data knowledge to
build better (and faster) dictionary schemes. This includes field based
dictionaries (as discussed) and known data formats (e.g. an XML or HTML Blob
Filter could take advantage of the tokenized format and use "<" and ">" to
sync the dictionary algorythm with).

* Good point on sorting index keys; however, a field based dictionary (where
the entire field value is in the dictionary) might not have the same
requirements for sorting (if the goal is to keep all the same values
together in the index, if it's to use it for order by clauses or processing
inequalities, then that's another issue). As hinted about earlier, field
based dictionaries don't have the same performance problems as standard
dictionary based systems since the entire field value is the dictionary
entry, so there's only one lookup required (which can be a binary search or
a hash).

* Also an interesting side thought is sending the data in compressed form
across the network and letting the client driver decompress it. This could
be a big speed win for large chunks of data.

Thanks for taking the time to share your thoughts.

-----Original Message-----
From: Ann Harrison [mailto:harrison@...]
Sent: Tuesday, November 07, 2000 2:56 PM
Subject: RE: [IB-Architect] Next ODS change (was: System table change)

At 11:48 AM 11/7/2000 -0700, David Berg wrote:
>On the subject of compression algorythms, have you considered dictionary
>based schemes?

Charlie told me that measurements showed that computing a checksum
for each page immediately before writing and immediately after reading
took 10% of the server CPU time. Since partial writes are very rare,
they stopped checksumming pages. I think about that every time I
starting thinking about elaborate compression or encryption schemes.

>First, it would be particularly particularly valuable in dealing with large
>text objects, such as XML and HTML pages (which have lots of repetition).

As a blob filter, dictionary-based compression would be great and can
be added at any time.

>Next down, dictionary compression of records would allow for common
>repeating patterns to be compressed together. For example, if most men pay
>with a visa card, and the fields SEX, CHARGE TYPE, and CARD# are in
>sequence, then a dictionary might treat "MVISA 4" as a single dictionary
>token (4 is the first digit of all Visa cards). Likewise women who pay
>would compress "FCASH " as a single token. (No sexual
>stereotypes intended)

What InterBase currently uses is a run-length compression which is
cheap to compute. It uses a signed byte to indicate length. A positive
value of <n> shows that the next <n> characters are part of the record.
A negative value of <n> shows that the next character (byte value) should
be repeated <n> times.

If the length indicator were a word, an empty varchar(32000) would be
stored as three bytes, rather than the current 300 or so. The algorithm
might use the high order bit to indicate whether the indicator was one
byte or two and the next bit to indicate a run of some character. Slightly
less good for long empty strings, better for short empty strings.

>For example a field declared as "Char(20) Dictionary(8)" would appear to
>outside world as a 20 character Alpha, but the internal storage would be an
>8 bit reference to a dictionary (thus allowing up to 255 unique 20
>values in the field). (You can actually think of this as two tables with
>implied updatable join, although I suspect the performance would be faster
>using special purpose dictionary code.)


>One of the particularly neat things about field based dictionaries (and
>record based compression in general) is that they can compress keys just as
>effectively. And the more keys stored on a single memory page, the faster
>the lookup.

Ummm. Right. It is significantly important that index keys (in their
key incarnation - not as they appear in the row) sort correctly byte-wise.
Prefix compression works exceedingly well for index keys ... Check the
stats on almost any index - the average key size will be much shorter than
the declared key length.

My thoughts, anyway.



To unsubscribe from this group, send an email to: