Subject | Text search ... |
---|---|
Author | Lester Caine |
Post date | 2019-03-12T22:44:32Z |
I've got a few of sites where I've got a growing number of pdf files
which it would be nice to actually index the content. First problem is
obviously the different qualities of pdf, and I've had finereader
deployed in some cases to provide OCRed copies of the original, with the
usual variable success. The question is just what is the best base to be
working towards. I'm currently working on the basis that we store the
original file, and I create thumbnails of the front page so I'm now
looking to striping the raw text. Anybody been there already? Any
suggestions for Linux based solutions ...
The current indexing process is pulling a list of words from the
document and building a manual index. It was first working pre-Firebird
and has not changed so is there a better was with FB3?
--
Lester Caine - G8HFL
-----------------------------
Contact - https://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - https://lsces.co.uk
EnquirySolve - https://enquirysolve.com/
Model Engineers Digital Workshop - https://medw.co.uk
Rainbow Digital Media - https://rainbowdigitalmedia.co.uk
which it would be nice to actually index the content. First problem is
obviously the different qualities of pdf, and I've had finereader
deployed in some cases to provide OCRed copies of the original, with the
usual variable success. The question is just what is the best base to be
working towards. I'm currently working on the basis that we store the
original file, and I create thumbnails of the front page so I'm now
looking to striping the raw text. Anybody been there already? Any
suggestions for Linux based solutions ...
The current indexing process is pulling a list of words from the
document and building a manual index. It was first working pre-Firebird
and has not changed so is there a better was with FB3?
--
Lester Caine - G8HFL
-----------------------------
Contact - https://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - https://lsces.co.uk
EnquirySolve - https://enquirysolve.com/
Model Engineers Digital Workshop - https://medw.co.uk
Rainbow Digital Media - https://rainbowdigitalmedia.co.uk