OCR of image PDF's from command line - any ideas?

  • Thread starter Thread starter Ed
  • Start date Start date
E

Ed

Hello wizards,

I need to accomplish the following task:

- Iterate through a large directory structure of files
- For each file found that is an image-only PDF (no text)
I need to OCR the file and save it in the same folder it
was found as origfilename_OCRed (PDF text format).

Despite a lot of searching and trying several OCR programs,
I have not been able to find a solution for OCRing from the
command line and converting multiple image PDF's into
multiple text PDF documents. I'd be happy with a solution
on either Windows or Linux, that doesn't cost huge $ and is
reasonably accurate. Neither OmniPage nor FineReader for
instance appear to have command-line options.

As a bonus, I'd love any ideas on how to recognize from the
command line whether a PDF file is image-only or text, since
I only want to OCR the image PDF files.

Thanks in advance!
--Ed Rozenberg
 
Ed said:
Despite a lot of searching and trying several OCR programs,
I have not been able to find a solution for OCRing from the
command line and converting multiple image PDF's into
multiple text PDF documents. I'd be happy with a solution
on either Windows or Linux, that doesn't cost huge $ and is
reasonably accurate. Neither OmniPage nor FineReader for
instance appear to have command-line options.
As a bonus, I'd love any ideas on how to recognize from the
command line whether a PDF file is image-only or text, since
I only want to OCR the image PDF files.

The utilities pdftotext resp. pdfimages convert a pdf file to text resp.
extract the images of a pdf file. Both work under Linux and are open
source, so it should be possible to get them somehow to work under Windows,
too. You can find them for example in the xpdf-utils package in Debian
Linux.

I guess it should also be possible to feed the extracted images
into some OCR program, and have the OCR program then create a text
pdf file for you. Someone else might know more about that than I do.

- Dirk
 
Ed,

Have you looked at "OmniPage Agent" within the ScanSoft Omnipage Pro
features? Seems like it might cover your needs.

Fred
===============
 
["Followup-To:" header set to comp.periphs.scanners.]

This is par for the course when dealing with Windows programs, sadly
enough. They sell SDKs with bindings for C, Visual Baysick, and
possibly Java bindings for these commercial OCR engines, but that's
probably more money than you want to spend.

It's possible to control the TypeReader commercial OCR application with
DDE, but that sort of requires writing C/C++ code. I've done this;
holler at my e-mail (mind the SPAN TRAP) for some more information.
As a bonus, I'd love any ideas on how to recognize from the command
line whether a PDF file is image-only or text
The utilities pdftotext [and] pdfimages convert a pdf file to text
[and] extract the images of a pdf file.

This is one possibility. The thing is, these utilities take some time
to run, particularly on large PDFs. There should be a reasonably simple
way to look at the raw PDF and determine whether it's full of images or
full of text, but I don't have time to gin up a utility to do that just
now.
I guess it should also be possible to feed the extracted images into
some OCR program, and have the OCR program then create a text pdf file

Yes. The main problem with it is that the Free OCR engines that I've
seen are not really very good. If you have a commercial OCR engine
produce text, you can then feed that text into enscript, then into
ps2pdf. This will pretty much kill the layout, but it'll produce a text
PDF, no problem.
 
Thanks for your ideas everyone - it looks like there are few if any
options
for command line use other than purchasing and developing against
SDK's. So I gave the built-in GUI automation options a go again:

- I tried the Omniscan Batch Agent again with no luck - it "choked"
when
I tried to feed it as few as 3 documents to be automatically
converted.

- I was successful with the ABBYY FineReader Automation Manager. I
set up a workflow including the steps Read -> Process -> Save.
Gave it an input directory "OCR" and an output directory OCR_OUT.
Put 150 image PDF files in the OCR directory and ran the automation
agent on it. Several hours later it produced 150 text PDF's as a
result
and they look good. One thing that I find funny is that it loaded
all the
pages for all the documents first (1000's of pages) then OCR'd them
one at a time. It then saved them as PDF files with the original
source
file names, which I what I wanted. There were a number of warnings
regarding some pages that were too rotated and some other problems
with a few of the pages, but otherwise looks great. I haven't found
a way
to easily jump to the few error pages out of the 1000's of pages, but
happy enough for now with the results.

Regards,
--Ed
 
Thanks for your ideas everyone - it looks like there are few if any
options
for command line use other than purchasing and developing against
SDK's. So I gave the built-in GUI automation options a go again:

ABBYY Fine Reader Pro does this like a charm. You can feed it a whole book
in PDF form and it will spit out a new version at the end that is still a
PDF but has the page images over the text and is much smaller. Or you can
output whatever you want.
 
Hi Ed,

We'd be happy to build exactly such a program for you. In fact, we have
several such programs, and have deployed them in scripting
environments.

We're able to get you a high accuracy rate by using multiple engines if
necessary.

Alternatively, you could send us the PDF files, we would convert them
and send them back to you. This is a good option if you have a one-time
requirement.

Contact us at (e-mail address removed) with your volumes, etc., and we
can work something out.

Regards,
Milind Joshi

IDEA TECHNOSOFT INC.
http://www.ideatechnosoft.com
 
Back
Top