Modifying pdf files produced by a document scanner?

  • Thread starter Thread starter Leonard Evens
  • Start date Start date
L

Leonard Evens

I am in the process of scanning a large number of documents, most of
text, but some of hand written notes. I'm using two scanners. One is
part of a commercial copier/scanner in my department, and the other is a
Canon DR-2050C. With the latter I can exercise some control to enhance
legibility, but that is harder with the department scanner since it
sends me e-mail with an attachment which must be examined later.

It seems plausible that one should be able to modify the resulting files
to enhance legibility when viewed with the Acrobat Viewer. But so far I
have not had much success with what I've tried. I've found software to
use to combine files, split them, rotate pages, and even to annotate
them. The Canon provided an OEM version of Acrobat Standard Edition,
which can do all the aforementioned things, but it can't enhance legibility.

Essentially everything I get is just black and white, but often the
black is just a light gray and hard to read. I would like to make it
darker. I've tried decomposing the pdf files into separate image files
and apply various image manipulation software to darken the blacks.
But generally the files produced are awful. So if I export a page as a
jpeg of tiff file, that the text in that file looks very digitized. If
I enhance it and conver it back to pdf, Acrobat shows me the same crummy
thing as what I had in the image file.

I would like to use programs which run under Linux such as Image Magick
which can convert back and forth and also enhance in a variety of ways,
but the results suffer from the problems I just described.

The one exception to this seems to be Photoshop under Windows, but I'm
not sure what's going on there. I don't have any greater luck using it
if the pdf file has been converted to jpeg of tiff files by ImageMagick
or another such program. But if I use Acrobat Standard Edition to
export jpegs, one for each page, those look the same in Photoshop (but
not in other viewers) as the pdf file did in Acrobat. On the other
hand, when something strange happens when I try to make adjustments.
The adjustments show up in the image window as long as the adjustment
tool is active, but revert to what they looked like before the
adjustment when I click OK in the adjustment window. I think the
adjustment is may actually be partially effective, but the change is not
dramatic.

I think that when Acrobat views a pdf file produced by a scanner is does
something different than simply displaying the pixels as some other
iamge viewer would do. Some information about that might be helpful.
Pdf files can be examined in a text editor since they are just modfied
postscript files, but doing so on those obtained by scanning are mostily
filled with binary coded data, i.e. image files. But as I noted above,
Acrobat is displaying them in some optimal way which other image
viewers don't use and perhaps Photoshop also employs the same viewing
mechanism.

In any case, I don't know how to use Photoshop in batch mode from a
command line to process a list of jpegs, so it would be impractical to
make modifications one image at a time.

Any suggestions would be appreciated.
 
Leonard said:
I am in the process of scanning a large number of documents, most of
text, but some of hand written notes. I'm using two scanners. One is
part of a commercial copier/scanner in my department, and the other is a
Canon DR-2050C. With the latter I can exercise some control to enhance
legibility, but that is harder with the department scanner since it
sends me e-mail with an attachment which must be examined later.

It seems plausible that one should be able to modify the resulting files
to enhance legibility when viewed with the Acrobat Viewer. But so far I
have not had much success with what I've tried. I've found software to
use to combine files, split them, rotate pages, and even to annotate
them. The Canon provided an OEM version of Acrobat Standard Edition,
which can do all the aforementioned things, but it can't enhance
legibility.

Essentially everything I get is just black and white, but often the
black is just a light gray and hard to read. I would like to make it
darker. I've tried decomposing the pdf files into separate image files
and apply various image manipulation software to darken the blacks. But
generally the files produced are awful. So if I export a page as a
jpeg of tiff file, that the text in that file looks very digitized. If
I enhance it and conver it back to pdf, Acrobat shows me the same crummy
thing as what I had in the image file.

I would like to use programs which run under Linux such as Image Magick
which can convert back and forth and also enhance in a variety of ways,
but the results suffer from the problems I just described.

The one exception to this seems to be Photoshop under Windows, but I'm
not sure what's going on there. I don't have any greater luck using it
if the pdf file has been converted to jpeg of tiff files by ImageMagick
or another such program. But if I use Acrobat Standard Edition to
export jpegs, one for each page, those look the same in Photoshop (but
not in other viewers) as the pdf file did in Acrobat. On the other
hand, when something strange happens when I try to make adjustments. The
adjustments show up in the image window as long as the adjustment tool
is active, but revert to what they looked like before the adjustment
when I click OK in the adjustment window. I think the adjustment is may
actually be partially effective, but the change is not dramatic.

I think that when Acrobat views a pdf file produced by a scanner is does
something different than simply displaying the pixels as some other
iamge viewer would do. Some information about that might be helpful.
Pdf files can be examined in a text editor since they are just modfied
postscript files, but doing so on those obtained by scanning are mostily
filled with binary coded data, i.e. image files. But as I noted above,
Acrobat is displaying them in some optimal way which other image
viewers don't use and perhaps Photoshop also employs the same viewing
mechanism.

In any case, I don't know how to use Photoshop in batch mode from a
command line to process a list of jpegs, so it would be impractical to
make modifications one image at a time.

Any suggestions would be appreciated.
I'm no expert in this, but pdf is not an image format. Pdfs contain an
image file (tif, jpg etc) stored within. The pdf engine settings
determine what kind of image format, resolution and compression are
stored. PDF wants to do the work for you, but is more difficult to
control. Be careful, screen views may be deceiving and may not be
representative of your image. Some programs resample for your screen
rez- hence the difference in appearance but not file content between
programs. Try printing an image to judge readability and understand
what you have. If you need more control of the image, avoid the extra
layer of pdf and scan to tif where you can specify the settings needed
to get a readable image, can easily adjust levels, gamma, contrast, and
can later convert these to pdf if/as needed. Hopefully most of your
documents are similar and can be scanned at the same settings in batch.
The commercial copier/scanner should be able to do this fairly
automatically on normal documents.
 
Don't scan in "black and white", scan in 256-bit grayscale.

If you really want to do much, you may have to export the pages, edit
them with a photo editor (Adobe Photoshop (or Photoshop elements, or any
other photo program) and re-import the improved JPEGs.

The key is scanning in gray scale NOT black and white, and doing this
with the exposure parameters (brightness, contrast and gamma ... the
names are usually different, however) set correctly, typically with a
"scanning profile" that has been created in advance on the same scanner
(but, usually, by scanning a single typical page in single-sheet mode).
Not all scanners will let you do this, it depends on the scanning
software. Also, you may need different profiles for different documents
(colored paper (white vs. yellow tint or green tint), very light, old
and yellowed, etc.).

I know how to do it all with my hardware and software (HP scanner &
Acrobat version 6), but your setup may be entirely different. Acrobat
does not do the scanning, it uses the Twain interface of the scanning
software from the scanner manufacturer. Within acrobat, the page images
are stored as some format of graphics files (JPEGs usually). Modify
those ... either at the time of creation (by doing a gray scale scan and
having the parameters set correctly) or later (export, modify and
re-import) and you will modify the view that you see. If you do it
right, printing out the PDF file will be ABSOLUTELY indistinguishable
from the original, unless (and this is very possible) it's BETTER.
 
Bruce said:
I'm no expert in this, but pdf is not an image format. Pdfs contain an
image file (tif, jpg etc) stored within. The pdf engine settings
determine what kind of image format, resolution and compression are
stored. PDF wants to do the work for you, but is more difficult to
control. Be careful, screen views may be deceiving and may not be
representative of your image. Some programs resample for your screen
rez- hence the difference in appearance but not file content between
programs. Try printing an image to judge readability and understand
what you have. If you need more control of the image, avoid the extra
layer of pdf and scan to tif where you can specify the settings needed
to get a readable image, can easily adjust levels, gamma, contrast, and
can later convert these to pdf if/as needed. Hopefully most of your
documents are similar and can be scanned at the same settings in batch.
The commercial copier/scanner should be able to do this fairly
automatically on normal documents.

Thanks for your helpful response; thanks also to Barry Watzman.

I don't believe our department scanner gives me the option of scanning
to anything but pdf, but I will check again. In any event, I've already
recycled many of the documents, so I don't have the option of doing them
over again. When I print from the pdf files, I get something which is
not too bad, and I could rescan printed copies using my home scanner
which does give me more options, but that would be terribly time consuming.

I understand that the pdf files from scans just store image data. But
what I don't understand is how the Acrobat reader manages to make those
images look decent, but can't do the same if I export the images, modify
them, say, with ImageMagick and then reconvert back to pdf. There
appears to be some additional information which the Adobe produces seem
privy to which is more than just the image data.
 
There isn't. It's just displaying, exporting or importing whatever
format it's using (usually jpeg).
 
Leonard said:
Thanks for your helpful response; thanks also to Barry Watzman.

I don't believe our department scanner gives me the option of scanning
to anything but pdf, but I will check again. In any event, I've already
recycled many of the documents, so I don't have the option of doing them
over again. When I print from the pdf files, I get something which is
not too bad, and I could rescan printed copies using my home scanner
which does give me more options, but that would be terribly time consuming.

I understand that the pdf files from scans just store image data. But
what I don't understand is how the Acrobat reader manages to make those
images look decent, but can't do the same if I export the images, modify
them, say, with ImageMagick and then reconvert back to pdf. There
appears to be some additional information which the Adobe produces seem
privy to which is more than just the image data.

The point of printing the image was to allow you to judge the quality of
the image without the complications of down sizing the resolution for
screen display. If it looks OK to you printed, then the actual file
content is OK. Different programs handle the screen display part
differently and realizing how the low rez screen appearance relates to
the actual image content is crucial to you getting what you want. Try
different methods of resizing the images for screen display if reading
on screen is important. The ideal screen view will only happen after
conversion of the image to text by OCR. Examine each step in your
export from pdf, manipulation and reconversion to pdf to determine where
you are going astray as this should work.
 
Hi Leonard,

There is a product called Capio from Kofax that may help. It is available on
their support download page at
http://www.kofax.com/support/IP/Capio/1.5/downloads.asp.

This product is similar to PaperPort but does not produce a searchable PDF
file. However, it DOES include a version of Kofax's VRS (VirtualReScan)
product that does provide image enhancements and can produce clear, crisp
B&W images. In addition, you can configure it to keep a "master" image
around so that the Image Quality can be adjusted later.

If you need a searchable PDF file then you can use the standalone version of
VRS (http://www.kofax.com/products/virtualrescan/index.asp). This is the
same technology included in Capio but provided, essentially, as a scanner
driver. You can download a trial version (actually it is a full copy but
without a license images may be stamped after 30 days) by clicking on the
"How to Buy" link on the left side of the page and then clicking on the
"this site" link.

The standalone version of VRS does not keep a "master" copy of the document
for adjustment later. However, since you are using the Canon DR-2050C the
standalone version of VRS has a set of factory settings specifically for
that model. This means that your documents should come out looking great.
However, if necessary you can set VRS so that its unique interactive
real-time interface is displayed and you can adjust the settings and see the
changes right before your eyes.

I hope this helps.

Brian
 
Back
Top