Subject RE: [IB-Architect] Next ODS change / RLE Encryption
Author David Berg
There's an easy change to RLE encoding that would increase compression
substantially for long string values.

Use 0 to 128 to mean the character value.
Use -2 to -128 to mean a repeat count followed by a character.
Use -1 to mean that the next two bytes are a repeat count followed by a
character.

This would allow an empty varchar(32000) to be stored as 4 bytes:
-1,low byte(32000),high byte(32000), 0 or 32 (which ever's appropriate)

Since most large varchars are probably at least 50% empty, this would allow
the entire (blank) end of the string to be compressed as a single 4 byte
token. This would not only take less space, but it would probably be faster
to expand one 4 byte token than 250(+/-) 2 byte tokens.

In fact, in a typical table with a (mostly unused) var char (32000) you'd
see an average savings of over 400 bytes per record!

I picked -1 because I assume that the compression algorythm would never
compress a single byte. Assuming this means that -1 is never used, existing
data could be processed reliably by the new algorythm, so no data conversion
would be necessary (although the database would need to be flagged with a
new format code so that old software versions would know they couldn't read
the data).

If my assumptions are correct, then -2 is probably also available. I
thought about using this to lead into a 4 byte compression, but unless this
is applied to memos or blobs, then there's no value to it (even there, it's
hard to imagine a document with more than 65536 repeated characters).


Another alternative would be to look at the second byte and apply this rule:

if (byte1 < 0) and (byte2 < 0) then
result is byte3 repeated (abs(byte1) * 128 + abs(byte2)) times

This has the advantage of encoding in one less byte (using the fact that we
know characters are >= 0 and <= 127); however, it has problems where the
number of repeated characters is a multiple of 128. This can be handled in
various ways: (a) use -128 to mean 0, (b) make byte2 the logical not of the
low order word (instead of the negative), (c) if the repeat count is a
multiple of 128, then encode all but the last character (making the repeat
count no longer a multiple of 128).

Frankly, I don't think saving the extra byte is worth the pain of the second
approach. Further, the savings may fail to materialize. The second
approach can compress 16383 characters into 3 bytes; however, that leaves an
empty VAR CHAR(32000) as 6 bytes (instead of the 4 of the first approach).

FWIW,

Dave

-----Original Message-----
From: Ivan Prenosil [mailto:prenosil@...]
Sent: Tuesday, November 14, 2000 7:16 AM
To: IB-Architect@egroups.com
Subject: Re: [IB-Architect] Next ODS change (was: System table change)


> From: Ann Harrison:
> What InterBase currently uses is a run-length compression which is
> cheap to compute. It uses a signed byte to indicate length. A positive
> value of <n> shows that the next <n> characters are part of the record.
> A negative value of <n> shows that the next character (byte value) should
> be repeated <n> times.
>
> If the length indicator were a word, an empty varchar(32000) would be
> stored as three bytes, rather than the current 300 or so.

RLE compression in IB is able to map maximally 128 repeating bytes into 2
bytes
(-128 and the character), so the best achievable compression ratio is 1:64.
It means empty 32000 CHAR will occupy exactly 500 bytes (nice number to
remember).

VARCHAR (besides 2 bytes for string length) can be even longer because
IB will not fill rest of string by zeroes (or spaces) when UPDATEing it.


Ivan
prenosil@...


To unsubscribe from this group, send an email to:
IB-Architect-unsubscribe@onelist.com