Scanning to searchable PDF

  • Thread starter Thread starter Richard Evans
  • Start date Start date
R

Richard Evans

I need to scan 8.5 x 11 loose sheets into searchable PDFs in large
batches: 400 - 600 at a time.

I want a sheet feed scanner with a decent input tray (50+ pages) that
will scan to searchable PDFs quickly, accurately, and with a minimum
of fuss. Resolution is not terribly important. 600 dpi would do. Color
is not important.

I'm looking online at a Xerox 152 that seems to do what I want, but I
can't tell from the picture how big the feed tray is.

Any thoughts on the Xerox? Any other models that might suit me?

The Xerox is $600. I might go as high as $800. Of course, cheaper is
better.
 
Richard said:
I need to scan 8.5 x 11 loose sheets into searchable PDFs in large
batches: 400 - 600 at a time.

I want a sheet feed scanner with a decent input tray (50+ pages) that
will scan to searchable PDFs quickly, accurately, and with a minimum
of fuss. Resolution is not terribly important. 600 dpi would do. Color
is not important.

I'm looking online at a Xerox 152 that seems to do what I want, but I
can't tell from the picture how big the feed tray is.

Any thoughts on the Xerox? Any other models that might suit me?

The Xerox is $600. I might go as high as $800. Of course, cheaper is
better

By how big the feeder is do you mean it's capacity?? If so, the Xerox
site "http://www.xeroxscanners.com/default.asp?pageid=140" (without the
quotes)states that it holds 50 sheets of 20lb paper and scans @ 30
images per minute in duplex mode. 1,000 sheet per day cycle rate.

Hope that helps,
Dave

PS - Google is your friend
 
Richard Evans said:
I'm looking online at a Xerox 152 that seems to do what I want, but I
can't tell from the picture how big the feed tray is.

Any thoughts on the Xerox? Any other models that might suit me?

The Xerox is $600. I might go as high as $800. Of course, cheaper is
better.

OK, I ordered the Xerox. Got it at Amazon for $435. I did read one
scary review from a guy who said scanning speed for searchable PDFs
was only 1.6 (that's one point six) pages per minute. Anyone have a
siimilar experience?
 
Richard Evans said:
[I'd like to scan paper documents to] searchable PDF
http://www.google.com/search?hl=en&q="+searchable+PDF
The first few lines of results appear to [show] me that OCR is [being
used].

Well, duh. In general, "searchable PDF"s have each page as an image,
with invisible text behind that image. This approach sorta works. If
the type is clean, not skewed, and in a reasonable font, you can get
semi-accurate OCR out of it. It isn't 100% perfect, more like 98%.
Does OCR with this new Xerox run on [its] own? Or is it like every
other OCR [which] requires manual corrections?

A perfect OCR engine doesn't exist. OCR is a difficult problem--it's a
special case of the vision problem, actually, and anyone who's had a bit
of CS knows how difficult vision is. So if you need 100% accuracy,
you've gotta proof the results. If you don't proof them, you get things
like M turning into IVI, and c turning into o or vice versa.

Also, if you have to proof OCRed documents to 100% accuracy, there's
really no reason to keep the image (unless there are graphics on the
page). Text is smaller and easy to convert to other formats. PDFs are
write-once.
 
Don said:
http://www.google.com/search?hl=en&q="+searchable+PDF"
&btnG=Google+Search

The first few lines of results appear to me that OCR is emninent.

Does OCR with this new Xerox run on it's own?

Or is it like every other OCR and requires manual corrections as part of
the OCR process?

I won't know until it arrives, probably a week away. The sales blurb
says:

"Convert docments into searchable PDFs with Visioneer OneTouch
technology."
 
Richard Evans said:
[I'd like to scan paper documents to] searchable PDF
http://www.google.com/search?hl=en&q="+searchable+PDF
The first few lines of results appear to [show] me that OCR is [being
used].

Well, duh. In general, "searchable PDF"s have each page as an image,
with invisible text behind that image. This approach sorta works. If
the type is clean, not skewed, and in a reasonable font, you can get
semi-accurate OCR out of it. It isn't 100% perfect, more like 98%.
Does OCR with this new Xerox run on [its] own? Or is it like every
other OCR [which] requires manual corrections?

A perfect OCR engine doesn't exist. OCR is a difficult problem--it's a
special case of the vision problem, actually, and anyone who's had a bit
of CS knows how difficult vision is. So if you need 100% accuracy,
you've gotta proof the results. If you don't proof them, you get things
like M turning into IVI, and c turning into o or vice versa.

Also, if you have to proof OCRed documents to 100% accuracy, there's
really no reason to keep the image (unless there are graphics on the
page). Text is smaller and easy to convert to other formats. PDFs are
write-once.

I've likely OCR'd more document that you'll ever consider.

It's a much simpler and a less time consuming task to OCR "properly" into
a text editor or word prcoessing software as compared to a adding text
behind an image.

I've had a few instances (from 10's of thousands) that OCR with 100%
accuracy, however that result is dependent upon a handful of matters all
related to scanner and OCR software efficency.

BTW, I would not have bothered replying to this thread, however most
beginners with scanning are under the impression that OCR and most
scanning is uniform task and nothing could be farther from the truth. Far
too many mitigating circumstances exist on each project. Each new and
unrelated document may require a new or revision of you configuartion.

Simply piling a load of paper onto a sheet feeder and going about other
agendas, while your scanner proceed both un-monitored and operating
itself, will generally result in a plie of crap as far as output and what
you initially desired.
 
Dances said:
Don said:
Richard Evans wrote
[I'd like to scan paper documents to] searchable PDF
The first few lines of results appear to [show] me that OCR is
[being used]. Does OCR with this new Xerox run on [its] own? Or is
it like every other OCR [which] requires manual corrections?
A perfect OCR engine doesn't exist. OCR is a difficult problem [...]
So if you need 100% accuracy, you've gotta proof the results. If you
don't proof them, you get things like M turning into IVI,

Also, if you have to proof OCRed documents to 100% accuracy, there's
really no reason to keep the image (unless there are graphics on the
page). Text is smaller and easy to convert to other formats. PDFs
are write-once.
I've likely OCR'd more [documents] [than] you'll ever consider.

O RLY? From 2000..2005, I was the principal tester and fixer on a very
large document conversion project, and had to do code-monkey things on
that project as well. Hundreds of thousands of pages from the NYT, WSJ,
Boston Globe, Washington Post, and tons of smaller academic journals
were processed through code I was responsible for. And I had to
spot-check far too many of those pages for various operator errors.
And I've reverse-engineered large chunks of the file format of a certain
OCR engine for company purposes. Mostly extracting info that the engine
stores but DDE doesn't make available, but whatever. So I'd say I have
a fair idea of the ways that OCR engines can fail, and a lot more
experience than you credit me with.[0]
It's a much simpler and a less time consuming task to OCR "properly"
into a text editor or word [processing] software[,] as compared to
adding text behind an image.

If the first option is easier than the second option, then the software
you're using to do the second option is poorly designed. I worked with
another guy to modify the company's conversion software so that it could
produce PDFs with text-behind-image. It worked reasonably well. Too
bad the clients decided they didn't want that feature.
most beginners [in] scanning are under the impression that OCR and
most scanning is [a] uniform task[,] and nothing could be [further]
from the truth. Far too many [strange] circumstances exist on each
project. Each new and unrelated document may require a new
[configuration] or [a] revision of [your] [configuration].

If you can get a decent scan at 300 DPI with good contrast, the scan
isn't skewed, the fonts used are sane, there are no graphics or weird
layouts, and there's no page curl or broken type, OCR just might get 98
or 99% accuracy without much effort on your part. YDocumentsMV.
Simply piling a load of paper onto a sheet feeder and going about
other [tasks], while your scanner [proceeds] both un-monitored and
operating itself, will generally result in a plie of crap

True dat. OCR engines have improved a bit, but you still need human
intervention to get really good data. People are still much better at
grokking malformed text than computers are (as shown by the "captcha"
thing some webforums use.)

[0] Does "Proquest" ring a bell?
 
Back
Top