Subject Re: Record Encoding
Author Roman Rokytskyy
> You're a Java guy with builtin zip support. Why don't you try
> inflating and deflating your favorite 20,000 blobs and get a handle
> on the costs of of compression and decompression and some idea of
> the efficiency.

Jim, as I wrote, I do not care in this case about the CPU cycles, I do
care about page fetches. The tricky thing is that I usually know on
the client which "page" is needed - I know offset in the stream where
to start reading.

> Historical note: I was working for DEC's disk engineering group
> while productizing the first JRD. They sold disks and didn't want
> to hear anything about compression. I had to sell them compression
> as a performance feature. Worked them. Maybe it will work with
> Roman.

Sure, if it turns out that fetching my blob into memory, decompressing
it there, and finally seek is faster than accessing the page directly...

> I suspect you're wrong. They probably use a 32 bit object space,
> but the objects themselves are outside the object space.

Do you mean "permanent generation heap" where classes are kept? I do
not care about it since it affects only the number of classes that can
be loaded in VM, though I do care about the "normal" heap and it is
not possible to specify -Xmx1500m to Sun JVM on Linux. That was on
32-bit machine (2-CPU box with 4 GB RAM, also a 32-bit zLinux machine).

> In any case, you can always spend an extra $49.95 and buy a 64 bit
> machine.

If that solves the problem - yes. I have no experience with x86 64-bit
machines, only Solaris on ES4000. And that is no longer $49.95...

> >Fetch 2 MB blob into memory in order to get 4k block from it? Not
> >very efficient, isn't it?
> >
> If one blob fetch out of 100,000 is for an interior segment, yes it
> is. Almost nobody does that sort of thing for a number of reasons,
> the most important of which is that it performs like a pig across a
> wire.

This structure is an index "segment" - one is full-text, another is
spatial. Code jumps from one "record" to another in a fashion that can
be considered random. Funny enough, when I tried the scheme with ~4k
VARCHARs in each record and accessing it by PK, it performed approx.
4-5 times worser than a big BLOB with seek (page size was 4k).

> If you can't tell the difference, why should you care?

Sure, if there is no difference, or the difference is less than 30%, I
do not care. Ok, so far the only conclusion we can do now is that
performance test is needed. I will try to create a standalone test
case, then you can experiment with compression.

Roman