How can I manage scanning in large books into a searchable format that
retains the page image? PDF is ok but it takes too long to search the
generated files.
I assume you mean "PDF with 'invisible' text-behind-image". Yeah, if
you have a document like that, it takes a long time to grep it if the
document is large. I don't understand why it takes so long, since all
you really have to do is the equivalent of "zgrep $WORD" on all the
individual compressed text blocks.
You may have to invent your own format if PDF with text-behind-image
won't work and you need to be able to locate the words fairly precisely
on the page. If all you need is "$WORD is on page 5", just save the
OCRed text for each page in one file (page0001.txt, page0002.txt ...
pageNNNN.txt), grep -l $WORD on all the text files, view each page image
grep -l returns in your favorite image viewer. That may not work for
you; if so, you may have to deal with PDF since everyone can read it.
The company that I work for has done a lot of work in this area. We use
separate files; 1 page TIFF and 1 corresponding XML file with text in
ISO-8859-1 and coordinate information for each word embedded in the XML.
Using that, it's pretty easy and fast to grep the XML for a word and
then draw colored rectangles over every instance of the word you're
interested in on a copy of the page image. (1 Perl script using
Image::Magick, took about 2 hours to write and debug.) Unfortunately,
the process we use is geared towards high volumes of stuff and not
really suitable for the casual user ATM. *shrug*. Holler at my e-mail
(mind the SPAN TRAP) if you want to explore this potential commercial
solution.