Direct OCR of electronic document

  • Thread starter Thread starter Hi Ho Silver
  • Start date Start date
H

Hi Ho Silver

I have some electronic documents in the form of un-editable pictures in PDF
files. I need to convert them to editable text, not necessarily in PDF
format. What I have been doing is printing these PDF files, scanning them
on my HP 3970 scanner as a document [using the option to "Scan for editable
text (OCR)]" and saving as a MS Word file. I am using the HP Photo &
Imaging Software Version 2.1 that came with the scanner. What I would like
to do is convert the original electronic documents directly from their PDF
picture file to the editable text without the printing step. My questions:

1. Is it possible for me to do this direct conversion with the HP software
I have?
2. If not, what are other ways I could do this; e.g. with other software?
3. Is there an internet online service that can do this without me buying
new software?

Thanks
 
Hi Ho Silver staggered into the Black Sun and said:
I have some PDF files. I need to convert them to editable text,
not necessarily in PDF format.

Huh? PDF is essentially a write-once format. Even with expensive crap
like Acrobrat (full) and Enfocus Pitstop, trying to change text in a PDF
is an exercise in pain and futility.
printing these PDF files, scanning them on my HP 3970 scanner as a
document (using [an OCR engine]) and saving as a MS Word file.
I would like to convert the original electronic documents directly
from PDF to editable text without the printing step. Is it possible
for me to do this direct conversion with the HP software I have?

Bundled software is almost always broken and/or lacking useful features.
I doubt there's a way to do that.
If not, what are other ways I could do this with other software?

This script requires ImageMagick, bash, and Xpdf. All of these should
be already installed on Real OSes, but they're also available for 'Doze.

pdftoppm -r 300 -mono file.pdf prefix
for i in prefix*.pbm ; do
j=`echo $i | sed -e 's/.pbm/.tif/' `
convert -compress Group4 -resolution 300 -units PixelsPerInch \
$i $j
rm -f $i
done

(run TIFFs through OCR engine. You may have to do that manually, since
too few commercial OCR engines are scriptable in any sane way.)

NOTE: Depending on how these PDFs are set up, you may want to use -gray
and -compress LZW instead of -mono and -compress Group4. Try both on a
short PDF and see what you get in terms of image quality and OCRed text
quality.
Is there an internet service that can do this without me buying new
software?

Why would you need to buy software to do this? There are so many Free
tools out there that do so many things that there's very little need to
buy software in this modern age. (Unless you're not familiar with using
your computer to its full potential. Lots of people aren't, and they
pay for it with $, time, lost data, malware, and stupid problems.)
 
If you get full version Acrobat (not just the reader), the pages can be
directly exported as graphics pages (e.g. JPEG or TIFF).

Also, many OCR programs can directly accept PDF files as their input.

However, the software that comes with hardware (e.g. scanners) is
normally low-end stripped down. What you want is possible, but you will
probably have to buy some real software.
 
I have some electronic documents in the form of un-editable pictures
in PDF files. I need to convert them to editable text, not
necessarily in PDF format. ... What I would like to do is convert the
original electronic documents directly from their PDF picture file to
the editable text without the printing step. My questions:

1. Is it possible for me to do this direct conversion with the HP
software I have?
Dunno.



2. If not, what are other ways I could do this; e.g. with other
software?

The following worked for me in Windows XP with MS-Office 2003.

In the Acrobat viewer, print to "Microsoft Office Document Image Writer."
This is a virtual printer, like a pdf printer, but it uses a different file
format: mdi.

Mdi files open in a "Microsoft Office Document Imaging" application. There,
I used:
Tools -> Send text to Word
to activate the OCR software that's in MS-Office.


3. Is there an internet online service that can do this
without me buying new software?

Dunno.
 
Dances With Crows said:
Huh? PDF is essentially a write-once format. Even with expensive crap
like Acrobrat (full) and Enfocus Pitstop, trying to change text in a PDF
is an exercise in pain and futility.

Actually that depends a lot on the PDF - there are many kinds of PDF.
Some of them are really just completely stupid large clumps of pixels,
while others actually know about chapters, sections, headings,
text.... the latter are searchable, indexable and more modifiable
than the first.

As a useful bit of information, Adobe Acrobat, as far as I know, has
the ability to OCR non-editable bitmap PDF files into searchable,
editable ones. Pretty nifty considering the original non-editable
probably came straight out of a scanner!
 
Wow, I didn't know that any of that stuff was even in Office. Very
interesting (although I'm not sure it's the best answer to the original
poster's question).
 
I don'g know about the HP software, but in OmniPage one would use
"File-Import" rather than "File-Open". Try it.

Maris
 
MyVeryOwnSelf said:
The following worked for me in Windows XP with MS-Office 2003.

In the Acrobat viewer, print to "Microsoft Office Document Image Writer."
This is a virtual printer, like a pdf printer, but it uses a different
file
format: mdi.

Mdi files open in a "Microsoft Office Document Imaging" application.
There,
I used:
Tools -> Send text to Word
to activate the OCR software that's in MS-Office.
..

Thanks very much for this post, I learned a lot; had not even known about
"Microsoft Office Document Image Writer."! But I have now installed it from
my Office XP disk. A way I have found that works for me now:

1. Use Acrobat PDF selection tool to copy to clipboard.
2. Open “Microsoft Office Document Imaging” – choose Edit\Paste Pages
3. Do the OCR and/or convert to a Word document.
4. Looks pretty good.

Thanks again!
 
Back
Top