Scanning to searchable PDF

Richard Evans · Feb 10, 2007

I need to scan 8.5 x 11 loose sheets into searchable PDFs in large
batches: 400 - 600 at a time.

I want a sheet feed scanner with a decent input tray (50+ pages) that
will scan to searchable PDFs quickly, accurately, and with a minimum
of fuss. Resolution is not terribly important. 600 dpi would do. Color
is not important.

I'm looking online at a Xerox 152 that seems to do what I want, but I
can't tell from the picture how big the feed tray is.

Any thoughts on the Xerox? Any other models that might suit me?

The Xerox is $600. I might go as high as $800. Of course, cheaper is
better.

Dave · Feb 10, 2007

Richard said:
I need to scan 8.5 x 11 loose sheets into searchable PDFs in large
batches: 400 - 600 at a time.

I want a sheet feed scanner with a decent input tray (50+ pages) that
will scan to searchable PDFs quickly, accurately, and with a minimum
of fuss. Resolution is not terribly important. 600 dpi would do. Color
is not important.

I'm looking online at a Xerox 152 that seems to do what I want, but I
can't tell from the picture how big the feed tray is.

Any thoughts on the Xerox? Any other models that might suit me?

The Xerox is $600. I might go as high as $800. Of course, cheaper is
better

By how big the feeder is do you mean it's capacity?? If so, the Xerox
site "http://www.xeroxscanners.com/default.asp?pageid=140" (without the
quotes)states that it holds 50 sheets of 20lb paper and scans @ 30
images per minute in duplex mode. 1,000 sheet per day cycle rate.

Hope that helps,
Dave

PS - Google is your friend

Richard Evans · Feb 11, 2007

Richard Evans said:
I'm looking online at a Xerox 152 that seems to do what I want, but I
can't tell from the picture how big the feed tray is.

Any thoughts on the Xerox? Any other models that might suit me?

The Xerox is $600. I might go as high as $800. Of course, cheaper is
better.

OK, I ordered the Xerox. Got it at Amazon for $435. I did read one
scary review from a guy who said scanning speed for searchable PDFs
was only 1.6 (that's one point six) pages per minute. Anyone have a
siimilar experience?

Don · Feb 11, 2007

searchable PDF

http://www.google.com/search?hl=en&q="+searchable+PDF"
&btnG=Google+Search

The first few lines of results appear to me that OCR is emninent.

Does OCR with this new Xerox run on it's own?

Or is it like every other OCR and requires manual corrections as part of
the OCR process?

Dances With Crows · Feb 11, 2007

Richard Evans said:
Richard Evans said:

[I'd like to scan paper documents to] searchable PDF

Click to expand...

http://www.google.com/search?hl=en&q="+searchable+PDF
The first few lines of results appear to [show] me that OCR is [being
used].

Well, duh. In general, "searchable PDF"s have each page as an image,
with invisible text behind that image. This approach sorta works. If
the type is clean, not skewed, and in a reasonable font, you can get
semi-accurate OCR out of it. It isn't 100% perfect, more like 98%.

Does OCR with this new Xerox run on [its] own? Or is it like every
other OCR [which] requires manual corrections?

A perfect OCR engine doesn't exist. OCR is a difficult problem--it's a
special case of the vision problem, actually, and anyone who's had a bit
of CS knows how difficult vision is. So if you need 100% accuracy,
you've gotta proof the results. If you don't proof them, you get things
like M turning into IVI, and c turning into o or vice versa.

Also, if you have to proof OCRed documents to 100% accuracy, there's
really no reason to keep the image (unless there are graphics on the
page). Text is smaller and easy to convert to other formats. PDFs are
write-once.

Richard Evans · Feb 11, 2007

Don said:
http://www.google.com/search?hl=en&q="+searchable+PDF"
&btnG=Google+Search

The first few lines of results appear to me that OCR is emninent.

Does OCR with this new Xerox run on it's own?

Or is it like every other OCR and requires manual corrections as part of
the OCR process?

I won't know until it arrives, probably a week away. The sales blurb
says:

"Convert docments into searchable PDFs with Visioneer OneTouch
technology."

Don · Feb 11, 2007

I won't know until it arrives, probably a week away. The sales blurb
says:

"Convert docments into searchable PDFs with Visioneer OneTouch
technology."

http://www.google.com/search?hl=en&q="Visioneer+OneTouch"
&btnG=Google+Search

http://www.visioneer.com/company/news/releases/pr_120203.html

Utilize's Scan Soft's "Paper Port.

Is Xerox selling Visioneer scanners under their own brand name.

Don · Feb 11, 2007

Richard Evans said:
Richard Evans said:

[I'd like to scan paper documents to] searchable PDF

Click to expand...

http://www.google.com/search?hl=en&q="+searchable+PDF
The first few lines of results appear to [show] me that OCR is [being
used].

Click to expand...

Well, duh. In general, "searchable PDF"s have each page as an image,
with invisible text behind that image. This approach sorta works. If
the type is clean, not skewed, and in a reasonable font, you can get
semi-accurate OCR out of it. It isn't 100% perfect, more like 98%.

Does OCR with this new Xerox run on [its] own? Or is it like every
other OCR [which] requires manual corrections?

Click to expand...

A perfect OCR engine doesn't exist. OCR is a difficult problem--it's a
special case of the vision problem, actually, and anyone who's had a bit
of CS knows how difficult vision is. So if you need 100% accuracy,
you've gotta proof the results. If you don't proof them, you get things
like M turning into IVI, and c turning into o or vice versa.

Also, if you have to proof OCRed documents to 100% accuracy, there's
really no reason to keep the image (unless there are graphics on the
page). Text is smaller and easy to convert to other formats. PDFs are
write-once.

I've likely OCR'd more document that you'll ever consider.

It's a much simpler and a less time consuming task to OCR "properly" into
a text editor or word prcoessing software as compared to a adding text
behind an image.

I've had a few instances (from 10's of thousands) that OCR with 100%
accuracy, however that result is dependent upon a handful of matters all
related to scanner and OCR software efficency.

BTW, I would not have bothered replying to this thread, however most
beginners with scanning are under the impression that OCR and most
scanning is uniform task and nothing could be farther from the truth. Far
too many mitigating circumstances exist on each project. Each new and
unrelated document may require a new or revision of you configuartion.

Simply piling a load of paper onto a sheet feeder and going about other
agendas, while your scanner proceed both un-monitored and operating
itself, will generally result in a plie of crap as far as output and what
you initially desired.

Dances With Crows · Feb 12, 2007

Dances said:
Dances said:

Don said:

Richard Evans wrote
[I'd like to scan paper documents to] searchable PDF
The first few lines of results appear to [show] me that OCR is
[being used]. Does OCR with this new Xerox run on [its] own? Or is
it like every other OCR [which] requires manual corrections?

Click to expand...

A perfect OCR engine doesn't exist. OCR is a difficult problem [...]
So if you need 100% accuracy, you've gotta proof the results. If you
don't proof them, you get things like M turning into IVI,

Also, if you have to proof OCRed documents to 100% accuracy, there's
really no reason to keep the image (unless there are graphics on the
page). Text is smaller and easy to convert to other formats. PDFs
are write-once.

Click to expand...

I've likely OCR'd more [documents] [than] you'll ever consider.

O RLY? From 2000..2005, I was the principal tester and fixer on a very
large document conversion project, and had to do code-monkey things on
that project as well. Hundreds of thousands of pages from the NYT, WSJ,
Boston Globe, Washington Post, and tons of smaller academic journals
were processed through code I was responsible for. And I had to
spot-check far too many of those pages for various operator errors.
And I've reverse-engineered large chunks of the file format of a certain
OCR engine for company purposes. Mostly extracting info that the engine
stores but DDE doesn't make available, but whatever. So I'd say I have
a fair idea of the ways that OCR engines can fail, and a lot more
experience than you credit me with.[0]

It's a much simpler and a less time consuming task to OCR "properly"
into a text editor or word [processing] software[,] as compared to
adding text behind an image.

If the first option is easier than the second option, then the software
you're using to do the second option is poorly designed. I worked with
another guy to modify the company's conversion software so that it could
produce PDFs with text-behind-image. It worked reasonably well. Too
bad the clients decided they didn't want that feature.

most beginners [in] scanning are under the impression that OCR and
most scanning is [a] uniform task[,] and nothing could be [further]
from the truth. Far too many [strange] circumstances exist on each
project. Each new and unrelated document may require a new
[configuration] or [a] revision of [your] [configuration].

If you can get a decent scan at 300 DPI with good contrast, the scan
isn't skewed, the fonts used are sane, there are no graphics or weird
layouts, and there's no page curl or broken type, OCR just might get 98
or 99% accuracy without much effort on your part. YDocumentsMV.

Simply piling a load of paper onto a sheet feeder and going about
other [tasks], while your scanner [proceeds] both un-monitored and
operating itself, will generally result in a plie of crap

True dat. OCR engines have improved a bit, but you still need human
intervention to get really good data. People are still much better at
grokking malformed text than computers are (as shown by the "captcha"
thing some webforums use.)

[0] Does "Proquest" ring a bell?

Scanning 100 years of newspapers. Advice?	3	Dec 24, 2005
How do I get VueScan to scan full pages from my hp LaserJet 3030 ADF?	1	Jan 19, 2008
DocuMate 252/262 or fi-4120C2 for archiving documents?	0	Feb 9, 2005
Scanning alot of 35mm slides	34	Apr 1, 2004
high-iso scanning: minolta 5400 or nikon LS-50 ??	3	Feb 14, 2004
Best Flatbed Scanner for Scanning Books: A continuation of "DocuMate 252/262 or fi-4120C2 for archiv	10	Oct 17, 2005
Xerox Phaser 8200 Solid Ink Printer Observations	8	Dec 23, 2003
Excel protected workbook appears opaque to virus-scan?	5	Mar 13, 2006

Scanning to searchable PDF

Richard Evans

Dave

Richard Evans

Don

Dances With Crows

Richard Evans

Don

Don

Dances With Crows

Ask a Question

Similar Threads