Subject Re: [Firebird-Architect] Re: Full Text Search
Author Jim Starkey
Lester Caine wrote:

>They are just filters - and I'll ask again - could they be handled as
>BLOB filters or do we need something else. The first step is obviously
>taking the data and identifying indexable 'character entities' - coming
>up with at least an outline of where we are heading will mean I can
>start switching the hard coded stuff in line with it ;)
>
>
It could be done with filters, but the overhead of a Word or PDF filter
makes this very undesirable. I've found that maintaining the original
binary format document (Word, PDF, etc.) separately from the
"searchable" textual representation to be a good balance. It is almost
always necessary to retain the original for download and revision. It
also is very poor practice to mix textual and binary stuff in the same
blob column. The the actual indexing, Netfrastructure uses built in
filters for text, *ml (first non-white character is a .<.), and
differences between two versions during update operations.

You're unlikely to find a PDF filter, however. I've surveyed most of
the open source world looking for a decent one. There are a few, but
they don't work very well. I have one is that better than most, but it
still isn't very good. Even St. Google makes a hash out of most PDFs.
The problem is that the text doesn't exist in a PDF document, just
streams of postscript commands. To turn the gook into linear text, you
need to handle all of postscript down to the sizing of glyphs for
individual characters in every font. There is no alternative but to
emulate a laser print, generate a page image, then reassemble the text
based on x and y coordinates. It's hard, very hard. I'd much rather
argue with Nickolay about memory pools.