Subject RE: [IB-Architect] Next ODS change / RLE Encryption
Author David Berg
There's an easy change to RLE encoding that would increase compression
substantially for long string values.

Use 0 to 128 to mean the character value.
Use -2 to -128 to mean a repeat count followed by a character.
Use -1 to mean that the next two bytes are a repeat count followed by a

This would allow an empty varchar(32000) to be stored as 4 bytes:
-1,low byte(32000),high byte(32000), 0 or 32 (which ever's appropriate)

Since most large varchars are probably at least 50% empty, this would allow
the entire (blank) end of the string to be compressed as a single 4 byte
token. This would not only take less space, but it would probably be faster
to expand one 4 byte token than 250(+/-) 2 byte tokens.

In fact, in a typical table with a (mostly unused) var char (32000) you'd
see an average savings of over 400 bytes per record!

I picked -1 because I assume that the compression algorythm would never
compress a single byte. Assuming this means that -1 is never used, existing
data could be processed reliably by the new algorythm, so no data conversion
would be necessary (although the database would need to be flagged with a
new format code so that old software versions would know they couldn't read
the data).

If my assumptions are correct, then -2 is probably also available. I
thought about using this to lead into a 4 byte compression, but unless this
is applied to memos or blobs, then there's no value to it (even there, it's
hard to imagine a document with more than 65536 repeated characters).

Another alternative would be to look at the second byte and apply this rule:

if (byte1 < 0) and (byte2 < 0) then
result is byte3 repeated (abs(byte1) * 128 + abs(byte2)) times

This has the advantage of encoding in one less byte (using the fact that we
know characters are >= 0 and <= 127); however, it has problems where the
number of repeated characters is a multiple of 128. This can be handled in
various ways: (a) use -128 to mean 0, (b) make byte2 the logical not of the
low order word (instead of the negative), (c) if the repeat count is a
multiple of 128, then encode all but the last character (making the repeat
count no longer a multiple of 128).

Frankly, I don't think saving the extra byte is worth the pain of the second
approach. Further, the savings may fail to materialize. The second
approach can compress 16383 characters into 3 bytes; however, that leaves an
empty VAR CHAR(32000) as 6 bytes (instead of the 4 of the first approach).



-----Original Message-----
From: Ivan Prenosil [mailto:prenosil@...]
Sent: Tuesday, November 14, 2000 7:16 AM
Subject: Re: [IB-Architect] Next ODS change (was: System table change)

> From: Ann Harrison:
> What InterBase currently uses is a run-length compression which is
> cheap to compute. It uses a signed byte to indicate length. A positive
> value of <n> shows that the next <n> characters are part of the record.
> A negative value of <n> shows that the next character (byte value) should
> be repeated <n> times.
> If the length indicator were a word, an empty varchar(32000) would be
> stored as three bytes, rather than the current 300 or so.

RLE compression in IB is able to map maximally 128 repeating bytes into 2
(-128 and the character), so the best achievable compression ratio is 1:64.
It means empty 32000 CHAR will occupy exactly 500 bytes (nice number to

VARCHAR (besides 2 bytes for string length) can be even longer because
IB will not fill rest of string by zeroes (or spaces) when UPDATEing it.


To unsubscribe from this group, send an email to: