Subject Re: [Firebird-Architect] Blob Compress -- Some Numbers
Author Jim Starkey
Lester Caine wrote:

>The bottom line is we would like to be in control! How many of those
>msword and pdf files can you actually do a search on IN the blob? I
>would expect to have to build a text version of them that can can use (
>which is what *I* am doing now ;) ) and THAT blob will benefit from
>compression SOMEWHERE in the system? Of cause someone more clever than
>me would probably have a solution on searching the originals ?
>
>
Things in "documents" were predominantly two types of things: Word
documents that were content source saved for future reference (i.e.
download) and attachments. The Word documents are translated to HTML on
entry and never referenced unless an author wants the original source
back. There are lots of other blobs that contain HTML, but they didn't
get measure. If they had, the results probably would have been quite
different, but still misleading -- HTML compresses quite nicely, thank
you, but also relatively small compared with the Word document, so few,
if any, page reads for multi-page blobs are going to be avoided. I'm
inclined to think that small ASCII / utf-8 blob compression buys little
but overhead.

PDFs are next to unsearchable. The context is contained in postscript
commands compressed with LZW. To make PDFs searchable, I put on my
laser printer emulation hat, to a scatter gather of text from the page,
turned the sucker into straight text, and index that. Searching them on
the fly is worse that trying to figure out what a politician said given
a script of his speech.