firebird-architect - Re: [Firebird-Architect] Blob Compress -- Some Numbers

Subject	Re: [Firebird-Architect] Blob Compress -- Some Numbers
Author	Larry Baddock
Post date	2005-05-17T07:42:11Z

> I'm losing my enthusiasm for compressed blobs. I'm not convinced the
> big win is there.

I agree with you here.

Simple analysis of the comparisons for compression speeds, etc for different
algorithms found from google seem to indicate that compressing on reasonably
modern machines is around the 1-3 MB/sec mark. On tests that I have run with
a large disk seek component on a standard serial ATA disk reveal that the
disk is capable of about 4MB/s sustained transfer rate. (Of course, linear
reads off the disk are round about the 55MB/s mark, but that doesn't tell
you anything useful for real life usage patterns).

This means that you need round about 2 CPU's to keep 1 disk busy. Adding
more CPU's is EXPENSIVE. Adding disks and striping them is unbelievably
cheap.
Check out the relative pricing of a 4 CPU box with 2 disks, compared with a
1 CPU box with 4 disks in it. These should give about the same performance,
assuming that you manage to compress all your data by 50%, which looks like
a pipe dream with a totally general blob workload, to start off with. (Not
to mention that the 1 CPU box will still have spare processing capacity, to
boot).

Of course, this does not take into account that decompression is faster than
compression. Your mix of reads/writes will determine largely what you can
cope with. I just don't fancy the idea that a simultaneous insert of 10
blobs from clients can max out my CPU. There are already enough latency
problems when the CPUs hit the wall.

On another track - If you really do want to implement blob compression
anyway, you can still do it so that there is blob seek functionality for
super duper large blobs. If you read RFC1951, you will soon discover that
the compression trees within the DEFLATE algorithm are flushed every 32k
bytes ANYWAY, and the trees start again. All you need to do is flush the
compressor every 32k bytes of input (Which it is internally doing ANYWAY),
and write the compressed segments with a header, stating the byte range that
the compressed block contains. According to RFC1951, there is also a mode of
DEFLATE that does NOT look at previous blocks for redundant data, which will
simplify the coding, and speed up the compression (unfortunately decreasing
the compression ratio at the same time). This does mean that if you want to
offload compression to the clients, there's an awful lot of work to do
(think about it), so either you live with the server compressing the data,
or deal with the extra burden of complicating the client/server interaction,
and doing it correctly. Every time.

Larry