pdf files to text?

  • Thread starter Thread starter John Uebersax
  • Start date Start date
J

John Uebersax

Hi Group,

This isn't a scanner question per se, but if its too far off-topic
maybe someone can point me to another group:

Suppose one has a multi-page pdf file produced by scanning a text
document. (That is, suppose the pdf is just a graphical
representation of the original document.)

Is there sofware available with which you can supply such a pdf file
as input, and get as output a text translation?

Thanks in advance.

John Uebersax
 
I know my old version of Adobe Acrobat 5 (full paid version, not just the
reader) has an OCR module to it. I believe I had to download this free add
on from the Adobe website. You might check the newer version of Acrobat to
see if they have it.

Doug
 
Hi Group,

This isn't a scanner question per se, but if its too far off-topic
maybe someone can point me to another group:

Suppose one has a multi-page pdf file produced by scanning a text
document. (That is, suppose the pdf is just a graphical
representation of the original document.)

Is there sofware available with which you can supply such a pdf file
as input, and get as output a text translation?

Thanks in advance.

John Uebersax

John,

A scanned PDF file of text is basically a jpg image, that is
compressed image with the associated compression artifacts. So yes, it
can certainly be OCR'd, but the results will no be as good as they
would if a non-compressed image file (without compression artifacts)
were used. I would think any OCR program could do that, some probably
better than others. I'd say try Omnipage and ABBYY Finereader.
 
John,

A scanned PDF file of text is basically a jpg image, that is
compressed image with the associated compression artifacts. So yes, it
can certainly be OCR'd, but the results will no be as good as they
would if a non-compressed image file (without compression artifacts)
were used. I would think any OCR program could do that, some probably
better than others. I'd say try Omnipage and ABBYY Finereader.

After I wrote the above, I got curious as to what kin of job my old
copy of Pmnipage Pro 12 would do, so aI ran a test. I took a page of
text that I had scanned to a tif file, then converted to PDF, and
asked Omnipage to OCR the PDF file. It did a surprisingly good job. To
see the results, look here:
http://freepages.genealogy.rootsweb.com/~charlieh/temp/

I posted a copy of the original PDF and the OCR'd to Word Doc file.
The OCR was done with no corrections whatever, and no edits to the
Word file either, so as to give you a direct comparison of the PDF to
the DOC files.

The pertinent file names are A test.pdf and A test.doc
 
You've had several good bits of advice, but here's one more....

I just installed PaperPort 11 for my new DocuMate 152 scanner. (I've been
using PaperPort 9 for several years with my HP all-in-one.) Anyway, I have,
on many many occasions, scanned a document in .pdf format and, in PaperPort,
"dragged" the .pdf to the Word (or Word Perfect) "link". If it was a really
clean scan, the OCR conversion was relatively flawless. However, if you're
scanning a not-so-clear document, no OCR software will be able to make a
good translation. I also have used OmniPage. It's advantage over PaperPort
is that you can "proof read" the OCR as you go and make your corrections
before saving the document.
 
The compression artifacts are irrelevant and will in no way hinder the
quality of OCR as long as you didn't "over compress" the JPEG image. If
you don't see artifacts, neither will the OCR program; if the PDF file
size is still a couple hundred K per page for monochrome, or a megabyte
or so per page for color (both assuming 300dpi), you won't have an issue
due to the JPEG compression.

However, while you could export the PDF file to JPEGs of the individual
pages, Omnipage, I think, can take the PDF file directly as it's input.
 
I got curious as to what kin of job my old
copy of Pmnipage Pro 12 would do, so aI ran a test. I took a page of
text that I had scanned to a tif file, then converted to PDF, and
asked Omnipage to OCR the PDF file. It did a surprisingly good job.

I just tried it with Abbey finereader Pro - excellent results.

Peter Finney
Liphook
Hampshire
England
 
Back
Top