Subject Re: [IB-Architect] Next ODS change / RLE Encryption
Author Jason Wharton
How about a little bit more of a rethink on the row level compression
algorithm...

The VARCHAR data type already has within it the ability to declare its
length of significant characters. So, why are we worried about accurately
storing the "whitespace" bytes at all?

We have a two-byte length indicator telling what of the declared potential
length is significant. Let's take advantage of that for storage purposes as
well as assembling network packets to cross the wire.

My buffered datasets in IBO take advantage of this and I reap great savings
on memory use and it is super cheap computationally to populate a buffer at
its fully declared size for handling changes, etc. I am confident that the
whitespace removal algorithm would be equivalent to the current RLE one, if
not cheaper.

After the whitespace suppression is applied then we can simply proceed with
the blind RLE methods that were previously employed and as folks are
suggesting too. Just keep in mind that we have already taken a broad stroke
at the stuff that RLE normally takes care of efficiently.

Which, leads me to another thought, perhaps we could introduce another
algorithm more suited for a string of bytes that don't have lengthy
repetitions. One that, like the varchar whitespace removal, takes advantage
of the knowledge of the datatype that it has.

For example, the date type could be mapped to a small int by shifting it to
make it so that dates in the year 2000 range would have a delta that would
map nicely. Large int values could be mapped into smallints and ints, etc.

So, in short, what I am proposing is take advantage of whitespace
suppression in all of the column types and use compression algorithm X to
squeeze things even better based on an algorithm that expects a certain
density of random bytes that we would have by removing the whitespace.

As for CHAR columns. I think they could (when ideal) be stored as a VARCHAR
and have the length be the number of characters minus the PAD CHAR bytes
(depending on character set) off the end of the string when the number of
pad chars is above 2 bytes or so. I presume there is a PAD CHAR for these,
right? If the char column is full and has no pad bytes used it simply is
stored as a CHAR at its fully declared length.

FWIW,
Jason Wharton
CPS - Mesa AZ
http://www.ibobjects.com