firebird-architect - Re: [Firebird-Architect] Blob Compress -- Some Numbers

Subject	Re: [Firebird-Architect] Blob Compress -- Some Numbers
Author	Larry Baddock
Post date	2005-05-18T06:49:51Z

>> ...This means that you need round about 2 CPU's to keep 1 disk busy.
>> Adding
>> more CPU's is EXPENSIVE. Adding disks and striping them is unbelievably
>> cheap.
>>... I just don't fancy the idea that a simultaneous insert of 10
>> blobs from clients can max out my CPU. There are already enough latency
>> problems when the CPUs hit the wall.
>
> While not disagreeing with that argument in general, ordinarily the
> compression and decompression would happen on the client, which probably
> has under utilized CPU.

Yes, I agree that this is generally the best way to go. However, one must
not lose sight of the fact that there are a wide variety of compression
schemes available, and each has pros and cons with respect to differing
types of input. Leaving the actual compression of the data up to the
developer to embed in their application is, I believe, the way to go. Don't
bother to try and make it part of the client library OR the server. If the
developer wants to compress blobs, it can be done. (I have been doing this
for years). Remember that the developer has far more of an idea of what the
uncompressed stream contains, and is therefore in a far better position to
determine what type of compression scheme to use, than the developer of a
general purpose database system.
If, for example, there is a new compression stream developed tomorrow,
called CRUNCH or whatever, the developer can add this without having to wait
for client library updates, client library plugins OR server plumbing. It
can be incorporated right away, WITHOUT having to change any existing data
that is already stored in any database (see below).

What I tend to do (because all my compression is done on the client, of
course) is to have a stack of compression tools, and when the time comes to
compress a blob, use any subset or ALL of them simultaneously. Then pick the
algorithm (and the resulting output) that produced the smallest output. Mark
the front of the blob (first octet) with the kind of compression used, and
write out the rest of the compressed output after this. When you read back
the blob, just use the first octet to index the correct decompression
algorithm, and you're done. (This is out of the question for the server, of
course). Naturally, there is a 'RAW' compression algorithm, which does
nothing, and is the selected algorithm (and output) if the input stream is
incompressible.
Works for me (And no, I don't have a need to read individual blob segments
in my applications - it's the whole blob or nothing).
At the end of the day, I can have 3 blobs in a single table, compressed with
3 different algorithms, and all is well.
And I can still use a db client library written 3 years ago to get the data.
On Interbase, Firebird or whatever.

This, of course, won't help you in the case of a legacy app, which would be
nice to speed up by reducing disk seeks by simply doing an 'ALTER TABLE X
ALTER COLUMN THE_BLOB COMPRESSION ON' or whatever, but you can't win them
all. If the Firebird project is going to do some form of Application
Transparent compression for blobs, I believe that compression OFF should be
the default. I'm sure LOTS of folks have already compressed their blob data
in some way or another.

In general, I just don't think that compression is something that should be
addressed by the project right now. Perhaps, one day, when the engine is
again Disk bound instead of CPU bound, as well as bulletproof, developers
will have more time to apply their minds to the issue of compression in
general. I look forward to that day, and thank everyone who is involved with
the firebird project for their contributions so far. I just wish that I had
some spare time to devote to the project, apart from an email to the mailing
lists from time to time. Hopefully, that day will come soon. :)

Larry