Dances said:
Don said:
Richard Evans wrote
[I'd like to scan paper documents to] searchable PDF
The first few lines of results appear to [show] me that OCR is
[being used]. Does OCR with this new Xerox run on [its] own? Or is
it like every other OCR [which] requires manual corrections?
A perfect OCR engine doesn't exist. OCR is a difficult problem [...]
So if you need 100% accuracy, you've gotta proof the results. If you
don't proof them, you get things like M turning into IVI,
Also, if you have to proof OCRed documents to 100% accuracy, there's
really no reason to keep the image (unless there are graphics on the
page). Text is smaller and easy to convert to other formats. PDFs
are write-once.
I've likely OCR'd more [documents] [than] you'll ever consider.
O RLY? From 2000..2005, I was the principal tester and fixer on a very
large document conversion project, and had to do code-monkey things on
that project as well. Hundreds of thousands of pages from the NYT, WSJ,
Boston Globe, Washington Post, and tons of smaller academic journals
were processed through code I was responsible for. And I had to
spot-check far too many of those pages for various operator errors.
And I've reverse-engineered large chunks of the file format of a certain
OCR engine for company purposes. Mostly extracting info that the engine
stores but DDE doesn't make available, but whatever. So I'd say I have
a fair idea of the ways that OCR engines can fail, and a lot more
experience than you credit me with.[0]
It's a much simpler and a less time consuming task to OCR "properly"
into a text editor or word [processing] software[,] as compared to
adding text behind an image.
If the first option is easier than the second option, then the software
you're using to do the second option is poorly designed. I worked with
another guy to modify the company's conversion software so that it could
produce PDFs with text-behind-image. It worked reasonably well. Too
bad the clients decided they didn't want that feature.
most beginners [in] scanning are under the impression that OCR and
most scanning is [a] uniform task[,] and nothing could be [further]
from the truth. Far too many [strange] circumstances exist on each
project. Each new and unrelated document may require a new
[configuration] or [a] revision of [your] [configuration].
If you can get a decent scan at 300 DPI with good contrast, the scan
isn't skewed, the fonts used are sane, there are no graphics or weird
layouts, and there's no page curl or broken type, OCR just might get 98
or 99% accuracy without much effort on your part. YDocumentsMV.
Simply piling a load of paper onto a sheet feeder and going about
other [tasks], while your scanner [proceeds] both un-monitored and
operating itself, will generally result in a plie of crap
True dat. OCR engines have improved a bit, but you still need human
intervention to get really good data. People are still much better at
grokking malformed text than computers are (as shown by the "captcha"
thing some webforums use.)
[0] Does "Proquest" ring a bell?