Optimal Scanning into PDF

  • Thread starter Thread starter Terry Smythe
  • Start date Start date
T

Terry Smythe

Can somebody point me to a web site containing advice on how to acquire a
good quality PDF document vs tolerable file size?

I'm starting a project to convert an association's monthly journals, going
back to 1964, into PDF for web display initially followed by DVD's to
members.

There are 500+ issues. Each issue is an average 50 pages, 8 1/2 x 11.
Color cover, gray scale images and line drawings scattered throughout.
Text that ideally should able to capture.

I'm currently using PaintShopPro9 through an HP4670 "See-Thru" scanner at
300dpi, into JPG for most pages, and TIF (Fax-CCITT3) for text only pages.
I'm then using Acrobat 8 to create a PDF document by inserting all the
images, followed by Acrobat OCR, followed by Acrobat optimization.

All images comes throuat about 2400 pixels wide. Using these images with
a current 68 page issue, my PDF document emerges at 32 megs after OCR and
optimization.

If I go one step further and resize all images down to 1500 pixels wide,
file size shrinks to 7.3 megs after OCR and Optimization.

Before image resizing, the PDF image is crisp and sharp. After resizing,
image is somewhat fuzzy, but OCR capture appears unaffected and the document
prints out quite nicely.

Have I found an optimal process that gives me the best I can hope for? Or
have others found a better process for this kind of application?

BTW, I'm working with my own personal unbroken 43 year journal collection,
and I'm not prepared to guillotine off the spines for auto-feed scanning.
This is why I'm using the HP 4670, well suited for this situation.

I'm also not prepared to OCR scan the original journals into Microsoft Word,
followed by extensive editing. If I am to take this route, I do have ABBYY
Fine-Reader OCR Professional 8, and have experimented with this approach.
Not at all swift, particularly with many of the earlier journals which are
not good quality.

Thoughts of others for this application?

Regards,

Terry Smythe
Winnipeg, Canada
(e-mail address removed)
 
I use an HP 5490C scanner, which has an ADF (automatic document feeder),
and Adobe Acrobat 6. If you do this without an ADF it will take
forever, but if you do it with an ADF, you will have to "unbind" the
material .... best done with a printer's paper shear. 5490's can be
bought on E-Bay really cheap these days, but be sure to get one with the
power supply and ADF feed tray (note: The 5490C is just a 5470C with a
C9866A document feeder, but the power supply has to be changed when you
add the ADF (the new power supply comes with a 5490C or a C9866A but is
different from the power supply that comes with a 5470C). When I say
"cheap", I mean really cheap, like often under $20.

Make up a profile for both the cover and the content (2 separate
profiles) by scanning a single example of each manually (only to make
the profile ... set the size, gamma, highlight and shadow so that there
is no clipping (of blacks or whites) and no "bleed through").

Acrobat will do the scans just fine, scanning the odd pages (fronts)
first and the even pages (backs) separately, and then properly merging
them into a single document with pages in the proper sequence (if the
material is not double sided, it's even easier).

Do the scans all at 300 dpi unless there is specific reason to use a
higher resolution.

This doesn't do OCR, but Acrobat can do OCR (although I've never used
it), or you can feed the PDF files into Omni-Page later.
 
JPEG works fine for text as long as you don't get overly aggressive with
the compression. Trying save an 8.5" x 11" page of monochrome text at
300 dpi in a 10k JPEG file is a recipe for disaster, but in 100K or a
bit more there is no problem even with subsequent OCR.
 
But it would smaller and sharper as a png.


Barry Watzman said:
JPEG works fine for text as long as you don't get overly aggressive with the compression. Trying save an 8.5" x 11" page of
monochrome text at 300 dpi in a 10k JPEG file is a recipe for disaster, but in 100K or a bit more there is no problem even with
subsequent OCR.
 
I'd rather have a JPEG any day; so would most people. It's just a more
universal format.

As to the merits of your jpeg vs. png argument, I'd argue that you can't
say it would be smaller AND sharper, since both formats simply exchange
sharpness for size .... all that you can say, at best, is that at the
same sharpness, one format may be smaller than the other on some
documents (since the compression ratio is a function of the image
contents). And without even going into the merits of an argument that
png is smaller (I don't necessarily accept that it is), the difference,
whatever it is, and whichever direction it leans towards, is simply not
material to most people for most purposes. But JPEG unquestionably
remains a far more universal format, accessible by more software on more
systems.
 
I'd rather have a JPEG any day; so would most people. It's just a more
universal format.

Hundreds or thousands of PDF pages is a size problem in the best of cases. :)

TIF G4 was always the smallest way to go for line art (assumed for text).
G4 will be as small or smaller than JPG (which must be grayscale), but G4 is
lossless and the quality is totally pristine, whereas tiny JPG becomes
garbage, esp for text. PNG handles line art well too, lossless and comparable
size to TIF G3 2D, and smaller than TIF G3 1D, but about half again larger
than G4.

Newer Acobat can do JBIG2 (for line art), lossless and said to be "orders of
magnitudes smaller" than even G4, at
http://blogs.adobe.com/acrobatineducation/2007/02/optimizing_scanned_pages_par
t.html

This seems like it warrants looking into.
 
Re: "Hundreds or thousands of PDF pages is a size problem in the best of
cases. :)"

I have personally scanned over 20,000 pages and I have hundreds of
thousands of PDF pages on my PC. In a day where 320 Gigabyte drives are
$79 everyday at Sam's Club, and 500, 750 and 1TB drives are now in mass
production, It is simply NOT a problem.

In general, documents should not be scanned as Line Art, but rather as
Grayscale (8 bits per pixel) unless, of course, you need color. This
isn't intuitive, but after having scanned tens of thousands of pages,
and after doing a lot of experimentation, it's clear that line art
produces a FAR inferior scan of even just text documents with black
printing on white paper.
 
In general, documents should not be scanned as Line Art, but rather as
Grayscale (8 bits per pixel) unless, of course, you need color. This
isn't intuitive, but after having scanned tens of thousands of pages,
and after doing a lot of experimentation, it's clear that line art
produces a FAR inferior scan of even just text documents with black
printing on white paper.

?? Opinions must vary I suppose, but I get a very different answer. Yes,
grayscale aliasing can help the very low resolution on a computer video screen
display. And if you have to use JPG, then there is no choice anyway. But if
you may want to reprint the documents, then it is no contest... the line art
is lossless, no artifacts, at 300 or 600 dpi it matches what the laser printer
can do, and of course the tiny file size becomes extremely important too.
Given sufficient resoltion, Line art quality is essentially perfect, and JPG
is really terrible.
 
?? Opinions must vary I suppose, but I get a very different answer. Yes,
grayscale aliasing can help the very low resolution on a computer video screen
display. And if you have to use JPG, then there is no choice anyway. But if
you may want to reprint the documents, then it is no contest... the line art
is lossless, no artifacts, at 300 or 600 dpi it matches what the laser printer
can do, and of course the tiny file size becomes extremely important too.
Given sufficient resoltion, Line art quality is essentially perfect, and JPG
is really terrible.

FWIW, my experience seems to match more what Barry is saying, rather
than the experience of Wayne. I had to scan an old Genealogy book,
about 150 pages, mostly text but some photos. I experimented a bit
before setting a procedure. I found that scanning to line art directly
wasn't very good for me, even with the plain text pages. My best
results were obtained by scanning to grayscale tiff, then converting
the individual tiff files to B&W in Photoshop. This way I was able to
get the best contrast and really separate text from background. I then
imported the B&W images into Acrobat, and generated the PDF from
there. The photos were left as grayscale and imported into Acrobat.
This made for quite a bit larger PDF, but gave excellent photos in the
scanned document.

However, it was a LOT of effort. If I had thousands of pages to do,
I'd certainly try another method, even if it made for poorer results.
 
FWIW, my experience seems to match more what Barry is saying, rather
than the experience of Wayne. I had to scan an old Genealogy book,
about 150 pages, mostly text but some photos. I experimented a bit
before setting a procedure. I found that scanning to line art directly
wasn't very good for me, even with the plain text pages. My best
results were obtained by scanning to grayscale tiff, then converting
the individual tiff files to B&W in Photoshop. This way I was able to
get the best contrast and really separate text from background.

You must be speaking of the Photoshop Threshold control to convert to line
art mode, and of course the scanner (very many of them) has the same threshold
control in line art mode. But scanners do differ, and if line art threshold
is not available, then of course I would have to agree with you, since it is
an essential tool. But I dont agree because mine does a really fine job, with
trial threshold results judged in the Preview window (I only preview the first
page if all are alike). So IMO, the scanner offers the same control, equally
good, and of course it is both fewer steps and faster steps, by far.
 
You must be speaking of the Photoshop Threshold control to convert to line
art mode, and of course the scanner (very many of them) has the same threshold
control in line art mode. But scanners do differ, and if line art threshold
is not available, then of course I would have to agree with you, since it is
an essential tool. But I dont agree because mine does a really fine job, with
trial threshold results judged in the Preview window (I only preview the first
page if all are alike). So IMO, the scanner offers the same control, equally
good, and of course it is both fewer steps and faster steps, by far.

Fewer steps and faster both seem tremendous advantages. I'll have to
check out what my scanner and it's software can do. It's an Epson 3170
Photo, and I really haven't done much of this type of scanning. We
wanted to put a PDF copy of the Genealogy book on our Family CD, and
there was a time issue.... but it looks as if I'd have been better to
have spent more time looking into the optimum scanning method, and so
needed less time manipulating the images before going to the final
PDF.
 
Fewer steps and faster both seem tremendous advantages. I'll have to
check out what my scanner and it's software can do. It's an Epson 3170
Photo, and I really haven't done much of this type of scanning.

Yes, it is a tremendous advangage, convenience and speed. And I often think
results too.

http://files.support.epson.com/htmldocs/pr317p/pr317prf/howto_7.htm#improving%
20character%20recognition%20b

Threshold is normally the only control available in line art mode.
Most scanners offer it, and many call it "threshold", but some share other
control buttons (which have no function in line art mode), maybe either the
Histogram or maybe Brightness (control by name), but it still does threshold,
and result is visible in the Preview window. The one that works in line art
mode is threshold.
 
Not true. PNG is a non-lossy compression. It doesn't trade off sharpness. It trades off
number of colors, and if there's a very small number of colors (say, BLACK, and WHITE, or even
a few dozen, lilke most logos), it does REALLY well. And just about everyting supports. PNG
these days. Give it a try.
 
I totally disagree. In my [extensive] experience, Gray Scale gives a
better image than line-art, even when the original is just black text on
white paper. The fact is that the pixels that you scan don't line up
with the pixels on the paper exactly; edges will intersect pixels, and
you get more accurate reproduction if you allow such pixels to be
intermediate shades of gray than if you force them to black or white.
Further, getting exactly the right transition point for line-art is
difficult if the document is imperfect. And the issue is not for
monitor display, it is for reprinting the page.
 
A very special thanks for this discussion. You folks have given me some
very solid ideas, suggestions and guidance for which I am very grateful.
This project will likely take me several years to complete. It involves
scanning some 500+ issues of an association journal, with an average of 45
pages, which suggests some 23,000 pages overall.

With all that has appeared in this discussion, I can follow through, develop
a procedure appropriate for an eventual DVD, then a parallel procedure for
appropriate smaller web display documents. One of the by-products of this
adventure will be a first ever total index of every article that has ever
appeared.

Rather than scan a page twice, for high and low resolution, in what format
would you suggest that the initial scan be saved in as raw data? I am
using an HP 4670 "See-Thru" scanner, WinXPHome, and I have PSP9 and
Photoshop C2. I have a couple of very large hard drives to work with,
~950gigs in total.

For those inclined, have a look at my initial experiment - a recent 68 page
issue of my association journal, came through at 7.4 megs PDF file. Click
on:

http://mmd.foxtail.com/Smythe/AMICA-44-4_OCR-Optimized.pdf

I am very much aware of the magnitude of this initiative, which is why I'm
seeking advice on the front end. I'll do a number of experiments over
next few weeks and report back later. Time is a minor issue as I'm fully
retired.

Regards,

Terry Smythe
(e-mail address removed)
 
ANY scanner without an ADF is the wrong scanner, and the scanner that
you have is one of the worst choices. Just the time to change the pages
will kill you.
 
Barry Watzman said:
ANY scanner without an ADF is the wrong scanner, and the scanner that you
have is one of the worst choices. Just the time to change the pages will
kill you.

I do not disagree with you about the labor. However, it is my own
personal 39 year unbroken collection of association journals that I am
prepared to scan one issue, one page at a time. I am simply not prepared
to guilotine the spines of these journals.

Iam very much ware of ADF, but deliberately chose not to use a scanner
equipped with such a device. BTW, I do also have another HP scanner
equipped with an ADF.

Tscanner I am using, an HP 4670, in my judgement, is well suited for
scanning publications and rare books, with an absolute minimum of handling.
Means using it in a manner not covered in the owner's manual. The only
handling of a book or magazine is page turning, nothing more. I don't use
its cradle at all. A conventional flat bed scanner can be damaging to
rare books by virtue of excess handling.

If I or my association could afford a gorgeous Bookeye book scanner, we
certainly would do so. What I'm doing is our Plan B, something we can
both live with.

Regards,

Terry Smythe
 
I totally disagree. In my [extensive] experience, Gray Scale gives a
better image than line-art, even when the original is just black text on
white paper. The fact is that the pixels that you scan don't line up
with the pixels on the paper exactly; edges will intersect pixels, and
you get more accurate reproduction if you allow such pixels to be
intermediate shades of gray than if you force them to black or white.
Further, getting exactly the right transition point for line-art is
difficult if the document is imperfect. And the issue is not for
monitor display, it is for reprinting the page.

Extensive or not, IMO, you're kidding yourself. The gray pixel aliasing is
false detail not present on the original (definition of aliasing). How is
that more accurate? :) Rhetoric, suit yourself, but obviously you are not
scanning at sufficient resolution for the mode. 200 dpi is probably all the
size we can tolerate if scanning thousands of pages of documents into JPG (JPG
must be at least grayscale - even if these documents are of course not
grayscale material).

However if lineart, 200 dpi line art is just fax quality (and yes, I agree,
aliasing could help its appearance, but 200 dpi was a very poor choice if we
care about quality). 300 dpi line art is very utilitarian, normally good
enough, if not real fussy. 600 dpi line art better matches what the next
output laser printer can do (assuming the original documents are high quality
to justify it). Quality documents have no issue with the default 128
threshold, but for problem documents, just use your head and tweak the
threshold. It does help to see difficult cases if the threshold tool shows a
histogram, but even if not, just move it back and forth and look at it. For
most of these difficult cases, I'd bet on the line art quality instead of the
grayscale.

Note that 8.5x11 inches of line art at 600 dpi is only 4MB uncompressed, and
G4 might be 1/12 of that size (varies with blank page area) - as small or
smaller as JPG, but lossless and pristine. There is absolutely nothing shabby
about it. Not suitable for photo images of course.

Grayscale at 200 dpi is also near 4MB. However JPG compression to 1/12 size
will look bad, esp text, but even if not compressed, it wont print as well.

Point is, there is no size penalty for 600 dpi line art, and is it ever good.
Prints like a dream.

There are a few exceptions when line art might not be the choice. Maybe an
old birth certificate that is very yellowed and faded, etc... a grayscale
scan probably shows its existing look and feel, which may be desirable. A
good line art scan probably shows it more as new and pristine again, often a
good thing to restore old documents, but maybe not every case.
 
Back
Top