Disaster with documents scanned into pdf. Please help

  • Thread starter Thread starter Larry
  • Start date Start date
L

Larry

Summary of problem: I have pdf files I created by scanning which are
vastly too large. Though I had "OCR to scan to text" checked when I did
the scanning, apparently the resulting files are NOT text but graphics,
and so they are huge. Is there any way to change these graphic pdf
files to text pdf files?

Now the details:

Using the Lexmark all-in-one scanner, I scanned several hundreds of
pages of xeroxed text pages into pdf. I was doing this for an associate
friend to whom I had promised this job some time ago. The Lexmark user
interface is highly confusing and contradictory, and the Help material
is useless.

I checked Advanced Scannng features, and in that page I checked "OCR to
scan to text" and "Multiple document scan" or something like that. I
had to have the latter feature checked; otherwise, after each page
scanned, the destination application, e.g. Adobe or Word, would open
with that one page in it, making it impossible to scan multiple pages
into a single file.

Back at the main scanning page, if I then checked "Black text" I would
automatically lose the Advanced features I checked above. So I did not
check "Black Text." Rather, I proceeded with "Advanced scanning"
checked (which included having OCR checked and multiple documents
checked, as explained above).

However, the result was an enormous pdf. file, so that, say, a pdf file
with 15 pages would be over 10 megabytes. A 100 page pdf was over 80
megabytes. Last night I e-mailed the person I was doing this for to
tell him how big these files were. At that point the job was not half
done. However, he did not get back to me until late today, after the
job was completely done to tell me that that size was unusable for him.

I can't see any other way I could have scanned this, given the options
the Lexmark user interface was offering me. Is there any way that I can
change the existing pdf files to characters so that they will become
much smaller?

Also, what did I miss? Was there some step for each created pdf file
where I was suppoed to press a button for "Use OCR to convert to text"
and I didn't do it? I wasn't offered any such option. I had checked
"OCR scan to text" before doing the job, so I assumed that that step was
done automatically.

Larry
 
----- Original Message -----
From: "Larry" <>
Newsgroups: comp.periphs.scanners
Sent: Sunday, July 11, 2004 11:31 PM
Subject: Disaster with documents scanned into pdf. Please help

Summary of problem: I have pdf files I created by scanning which are
vastly too large. Though I had "OCR to scan to text" checked when I did
the scanning, apparently the resulting files are NOT text but graphics,
and so they are huge. Is there any way to change these graphic pdf
files to text pdf files?

Now the details:

Using the Lexmark all-in-one scanner, I scanned several hundreds of
pages of xeroxed text pages into pdf. I was doing this for an associate
friend to whom I had promised this job some time ago. The Lexmark user
interface is highly confusing and contradictory, and the Help material
is useless.

I checked Advanced Scannng features, and in that page I checked "OCR to
scan to text" and "Multiple document scan" or something like that. I
had to have the latter feature checked; otherwise, after each page
scanned, the destination application, e.g. Adobe or Word, would open
with that one page in it, making it impossible to scan multiple pages
into a single file.

Back at the main scanning page, if I then checked "Black text" I would
automatically lose the Advanced features I checked above. So I did not
check "Black Text." Rather, I proceeded with "Advanced scanning"
checked (which included having OCR checked and multiple documents
checked, as explained above).

However, the result was an enormous pdf. file, so that, say, a pdf file
with 15 pages would be over 10 megabytes. A 100 page pdf was over 80
megabytes. Last night I e-mailed the person I was doing this for to
tell him how big these files were. At that point the job was not half
done. However, he did not get back to me until late today, after the
job was completely done to tell me that that size was unusable for him.

I can't see any other way I could have scanned this, given the options
the Lexmark user interface was offering me. Is there any way that I can
change the existing pdf files to characters so that they will become
much smaller?

Also, what did I miss? Was there some step for each created pdf file
where I was suppoed to press a button for "Use OCR to convert to text"
and I didn't do it? I wasn't offered any such option. I had checked
"OCR scan to text" before doing the job, so I assumed that that step was
done automatically.

Larry

Larry,
I'm sorry to say that I warned you of the file sizes in
previous replies :-(((

Each scanner software is different. Unless I have the same scanner hardware
and software as you, I'm unable to advise you of accurate settings.

The only option you have is to save each PDF initially as a TIF.
From there you take another software (Ifranview works for me Photoshop 7.0
does not) and save each TIF as a new TIF.

Then you use your scanners OCR software and scan a file rather than an
image, OCR'ing each page in the process.

This may or may not be effective, as I said previously, it all depends of
the quality of what you began with and at what DPI you scanned as.
 
Larry,
I might add there is no such thing as a text-PDF. ALL PDF's are
images (or as your prefer graphic.
The difference between a Word or text document created as a PDF as compared
to an image, is that only the images of the text are scanned versus the
whole page image.
 
I might add there is no such thing as a text-PDF. ALL PDF's are images
(or as your prefer graphic.)

? This is not correct. It's possible to create a PDF that contains
instructions kind of like so:

(Palatino-Italic) findfont 20 scalefont setfont
10 20 moveto
(The quick brown dogcow jumped over the lazy fox) show
showpage

....and the resulting PDF will be much smaller in size than a PDF that
contains an image. The text string in the PostScript above is embedded
directly into the PDF and can be recovered with pdftotext or a similar
utility. The text PDF will also look good at any resolution, while an
image PDF will look grainy when you zoom in to 1600%.
The difference between a Word or text document created as a PDF as
compared to an image, is that only the images of the text are scanned
versus the whole page image.

That didn't make much sense. What did you really mean?
 
----- Original Message -----
From: "Dances With Crows" <>
Newsgroups: comp.periphs.scanners
Sent: Monday, July 12, 2004 9:41 AM
Subject: Re: Disaster with documents scanned into pdf. Please help

? This is not correct. It's possible to create a PDF that contains
instructions kind of like so:

(Palatino-Italic) findfont 20 scalefont setfont
10 20 moveto
(The quick brown dogcow jumped over the lazy fox) show
showpage

...and the resulting PDF will be much smaller in size than a PDF that
contains an image. The text string in the PostScript above is embedded
directly into the PDF and can be recovered with pdftotext or a similar
utility. The text PDF will also look good at any resolution, while an
image PDF will look grainy when you zoom in to 1600%.


That didn't make much sense. What did you really mean?


Cows,
First your imply that you know more about I as related to PDF's
and then you desire and expanded explanation on the difference between
scanned text and scanned images into "PDF's".

I have no desire to spend time on debates.
Larry previously inquired for assistance, I offered an expanded and
worthwhile explanation of which he failed to heed, read or investigate and
took another's less detailed explanation.
Then he returns and adds insults. Of course, in all fairness, I did submit
an "I told you so!" Even in my initial reply included a statement about "You
OCR the pages individually and make the corrections." would have saved him
all his wasted time.

His shortcomings or unwillingness to use and SE or something similar to
Webster's are not my fault or problem.
Neither is your willingness to debate.

If you know so much and desire and argument?
Than perhaps you can spend your time instructing Attempting to instruct and
communicate with Larry?

Besides, your writing communication is so much easier to understand than
mine :-))

Back to scanning. . . .
 
Summary of problem: I have pdf files I created by scanning which are
vastly too large.

Larry, you had said the pages were "all text" and that the purpose was
for web display. PDF file size is a big problem, 100 pages of color can
be astronomical size.

So I have two guesses: First is that you may be scanning in color or
grayscale mode, and an "all text" page suggests it should be scanned in
line art mode (certainly true if there is no color or grayscale to be
captured, for example black text). I am not familiar with your software,
but I'd guess its Black Text setting may mean line art mode, which is
appropriate for black text. But normally scanning software offers modes
of Color, Grayscale, or Line art, but sometimes they can find different
words.

The second guess is that you might also be scanning at printing
resolution, which would be excessive size for the web. Files just right
for printing are large images, which are unsuitable size for the web.
Try 100 dpi for pages to be displayed on the video screen (for a rough
try at actual size, or maybe 150 dpi if you might perceive 100 dpi as
being too small to read). I assume your software should allow control of
scan mode and resoltion.

In short, line art is 1/24 the size of a color image or 1/8 the size of a
grayscale image, and 100 dpi is 1/9 the size of a 300 dpi image.

100 dpi line art page images (no OCR) shouldnt be over 50KB per page, if
even that (assuming your PDF software does the expected file compression
for line art). Of course 100 pages may still run to 5 MB, and you may
instead want several smaller files for web access.

PDF files containing only text characters (no images) are vastly smaller
than page image files, but the normal way the text PDF are created is to
"print" the text document source (from a word processor like Word or
similar) to a PDF driver, as if it were a printer. Your installation may
or may not include that PDF printer driver, I dont know. I am speaking
of real text characters in Word, NOT page images in Word. This is
referring to printing the original text document source, so this means
your scanned case must do OCR into Word first. These text files are
greatly smaller than image files (for example, all the manuals we see in
PDF are the original text documents printed to PDF from a word processor
type of program - they are NOT scanned pages). OCR may lose the look and
feel of the original page, but PDF will retain the look and feel of the
Word page.

But if you were going to do the work of OCR, possibly you instead just
want to create regular HTML web pages instead of PDF. But PDF would have
the advantage to keep all the document pages together, all pages could be
downloaded or printed in one operation, the printed page would look the
same as the original page on the screen, etc.
 
Dances said:
(Palatino-Italic) findfont 20 scalefont setfont
10 20 moveto
(The quick brown dogcow jumped over the lazy fox) show
showpage

For me this seems to be PostScript.

best regards,
Hubert
 
It is. PostScript can be much more human-readable than PDF, and it's
easy to turn PostScript into PDF with ps2pdf. This was an *example*
designed to show how it is possible to make a "text PDF" when
lostinspace said that no such thing existed. There's not a 1-to-1
mapping between PDF elements and PostScript elements, but for text, the
concepts are very similar. You can see this for yourself by running
ps2pdf on the PostScript above and then looking at the generated PDF in
a text editor. The text string will get compressed, so it'll be
illegible, but it's obvious what's going on when you look at the markup
elements in the PDF.
In order to use it inside PDF, you might need to create encapsulated
postscript.

Not really necessary, use ps2pdf or a library like PDFLib to write a
"text PDF" directly.
 
Larry,
I might add there is no such thing as a text-PDF. ALL PDF's are
images (or as your prefer graphic.
The difference between a Word or text document created as a PDF as compared
to an image, is that only the images of the text are scanned versus the
whole page image.


From the scanner, yes if that is what you meant, since scanners can only
create images. But otherwise no, certainly not in general, PDF is
Postscript inside, which is all about text characters. The idea of PDF is
that it allows viewing/printing without having a Postscript device (PDF is
for Portable Document Format). The overhead is in the free Acrobat PDF
viewer program, basically it does the Postscript conversion so we mere
mortals can see or print it.

The Postscript page may also contain images, but images are not required.
A scanned full page image is simply a very large image on an otherwise blank
page. Most PDF files do contain mostly text characters, created by
"printing" the original text source document to a PDF printer driver
(selecting the word processor File - Print menu, and then selecting the PDF
driver as if it were a printer, which creates the PDF file). All the
manuals we see in PDF are done this way... the manual PDF file contains the
original text characters in Postscript page format (not an image of text
characters). The PDF manuals are NOT scanned pages, and never were. File
size would make that impossible.

The individual document page source content may also include a few small
images on the pages, and if so, such images are also included in the
Postscript. File size becomes huge if there are many large images however,
and astronomical size if it is all full page images. You can see the
differences though. The text characters are searchable (images are not
searchable). When you resize the PDF page to say 300%, the images msut be
resampled and become lesser quality, but the text characters are simply
larger, and still full quality at any size (same as making the text size
larger in a word processor). The text characters will both print and view
exceptionally well, but any images must be sized for one purpose or the
other, with the second mode far from optimum quality.

Some of us do get the notion to scan full page images into PDF, and while a
few cases can work, it will be a major problem due to file size (all those
full page images). One full page scanned in color mode at 300 dpi (for
printing) might be 25 MB per page. JPG compression can reduce that one full
page color image to 1/2 MB (if extreme 50:1 compression), but the JPG
quality becomes poor (text image quality is esp poor due to the excessive
JPG artifacts). And the 1/2 MB per page file size is still huge (and poor).

Full pages scanned at low resolution for the video screen, and in line art
mode for text, can be almost acceptable size in some cases. Line art
compression is effective, and lossless, full quality, at least at design
size. That is, 300 dpi line art images for printing is often poor when
viewed on the screen at screen size. And 100 dpi images that view well on
the screen typically prints poorly.

But real text characters "printed" to PDF from a word processor is always a
vastly smaller file, and a much higher quality presentation, and normally
the way it is always done.
 
Wayne,

Thanks for the help. I thought I had sent this message several hours
ago, but I had not.

First, I had one thing wrong. The person for whom I'm creating this job
wants to make the .pdf file available on line, not primarily for reading
online, but so that people can print selected pages from it. (It
consists of a selection of many short articles.)

Now, when I scan a single page of text into Adobe format, with "Scan as
Text (OCR)" checked, and with black and white checked, and with 300 dpi
checked, the resulting one-page .pdf document is 765 KB.

When I scan a single page into Lexmark Photo Editor, with Black Text
checked, and 300 dpi checked, and then save that resulting Photo Editor
document as a pdf file, the resulting one-page .pdf file is 1.05 MB.

So anyway I try it, I'm stuck with these huge files. The next thing for
me to try, according to what you suggested, is to try 150 dpi, and maybe
that would radically bring it down.

But better would be some way to reduce the size of the existing pdf
files I created, so I wouldn't have to do the job all over again.

Larry
 
First, I had one thing wrong. The person for whom I'm creating this job
wants to make the .pdf file available on line, not primarily for reading
online, but so that people can print selected pages from it. (It
consists of a selection of many short articles.)

Now, when I scan a single page of text into Adobe format, with "Scan as
Text (OCR)" checked, and with black and white checked, and with 300 dpi
checked, the resulting one-page .pdf document is 765 KB.

When I scan a single page into Lexmark Photo Editor, with Black Text
checked, and 300 dpi checked, and then save that resulting Photo Editor
document as a pdf file, the resulting one-page .pdf file is 1.05 MB.

So anyway I try it, I'm stuck with these huge files. The next thing for
me to try, according to what you suggested, is to try 150 dpi, and maybe
that would radically bring it down.

But better would be some way to reduce the size of the existing pdf
files I created, so I wouldn't have to do the job all over again.


PDF is not designed to be able to get your data out of a PDF file for a
second try (at least Acrobat is not). PDF is designed for viewing or
printing, not for retrieving data, so think starting over if you want to
change it. Acrobat (full version) will permit adding or deleting
individual pages. There are third party solutions that can read PDF,
OmniPage Pro OCR is one.

The overall main problem with the huge files is not so much your
technique, as much as it is just wanting to print the 100 page files,
which requires enough resolution to make the scope of the job really
huge, I'd say unreasonable for the web. Maybe like wanting to train an
elephant to be a lap dog pet... no matter how much you may want it, there
simply are some size issues. None of us could help you much with that,
it is not a matter of being willing to try, it is the nature of the
beast. The 100 page full page image job you are trying to do is of huge
size, and its size seems prohibitive for the web (at least for scanned
page images).

What program is doing this OCR and writing PDF? Are these options in the
Lexmark scanning software, or their photo editor software, or other third
party software? Their user manual is online, but there is no description
in it. 765KB for one page of OCR text is not impressive, it didnt go
well at all. To investigate that, I would try repeating that scan, but
save it to a Word file once instead of to PDF, and then you can inspect
it in Word (the actual text characters), and should get a good clue about
the problem, about what is not happening (seems much of it must be
remaining as image instead of text). $99 all-in-one scanners dont come
with elegant OCR software, which normally costs that much or more alone.
You could add better OCR software, and other PDF software too.

I am not familiar with your software, so your settings dont have much
meaning to me. By Adobe format, I assume you mean PDF in this context.
I dont know that Black Text means line art mode, but it is still my guess
that it does. An 8.5x11 inch page at 300 dpi has dimensions of
(8.5 inches x 300 dpi) x (11 inches x 300 dpi) = 2550 x 3300 pixels,
which is 2550x3300 = 8.4 million pixels.

If the scan mode is line art, then 8 pixels per byte, or 1 MB.
If the scan mode is grayscale, then 1 pixel per byte, or 8 MB.
If the scan mode is color, then 3 pixels per byte, or 25 MB.

Period. There are no other answers. 8.5x11 at 300 dpi is a big image.
This data can be compressed smaller in the file.

Your 1.05 MB PDF seems to have two possibilies.

1. If Black Text does mean line art, then you possibly get the right 1
MB answer, but it is not compressed, which seems wrong for PDF. If 300
dpi line art, one compressed PDF page should be closer to perhaps 115KB
(ballpark).

Or 2, possibly that the 1 MB is compressed grayscale, maybe more
likely, but I cant tell from here. Looking carefully at the image can
tell, it either contains a few gray tone pixels somewhere (grayscale), or
absolutely no gray at all, all fully black or white pixels (line art).

Using lower scan resolution will reduce the file size too (formula
above), but it may no longer print well... Good fax quality is 200 dpi
line art, I'd stay at least ther for printing. 200 dpi is about half
(4/9) the size of 300 dpi. Printing will be less crisp, but readable.

Regardless, 100 pages of line art images at 300 dpi or even 200 dpi is
going to be a huge file. Grayscale will be much larger, and line art
with compression will be your best bet for text-only pages in scanned
image form, but it is a huge job. This is simply the size of your data.

OCR into a word processor, and then writing a PDF with those text
characters will be the smallest way, probably the only viable way for 100
pages, but OCR is the most work by far, there can be errors which you
must proofread to correct. So you would want good OCR software for a
serious job, and there is some learning curve. Then I dont know what
software if any you have to write PDF from Word. Acrobat or PaperPort
Office are two ways to do that. OmniPage Pro will both do OCR and write
PDF but it is still a large job.

It is not my concern, but it seems a favor to mention that copyright
seems a probable problem too. You said you intend to post various
articles and book sections on the web for distribution to others. Unless
you own the copyright to those works, or have specific permission from
the owner(s), this is of course copyright violation, depending on the
owners whim to sue you for damages. By owner, I dont mean who bought the
book or magazine, I mean the copyright owner that owns rights to the use
of the material.
 
Wayne,

First, thank you very much for being responsive to my questions and for
writing in English, unlike some other people around here.

Before I get into responding to your points, a general question: People
create pdf files by scanning hard printed pages all the time. The web
is filled with such documents. How are such documents created? There
must be a standard way of doing it. That's why I don't understand why
I'm having such a hard time finding out what that standard method is.
(But I think that is addressed below.)
PDF is not designed to be able to get your data out of a PDF file for a
second try (at least Acrobat is not). PDF is designed for viewing or
printing, not for retrieving data, so think starting over if you want to
change it. Acrobat (full version) will permit adding or deleting
individual pages. There are third party solutions that can read PDF,
OmniPage Pro OCR is one.

Ok, so you're saying there's no way that I can somehow alter the
existing pdf documents and make their size smaller.
The overall main problem with the huge files is not so much your
technique, as much as it is just wanting to print the 100 page files,
which requires enough resolution to make the scope of the job really
huge, I'd say unreasonable for the web.

No. No one would print a hundred pages. This huge pdf file (630 pages)
consists of many short articles of between five and twenty pages.
Interested readers would print just the articles they were interested
in.
What program is doing this OCR and writing PDF? Are these options in the
Lexmark scanning software, or their photo editor software, or other third
party software?

The OCR is an option in the Lexmark advanced scanning options. There is
no separate application mentioned. The OCR is done by the Lexmark
itself, or is supposed to be.

Their user manual is online, but there is no description
in it. 765KB for one page of OCR text is not impressive, it didnt go
well at all. To investigate that, I would try repeating that scan, but
save it to a Word file once instead of to PDF, and then you can inspect
it in Word (the actual text characters), and should get a good clue about
the problem, about what is not happening (seems much of it must be
remaining as image instead of text). $99 all-in-one scanners dont come
with elegant OCR software, which normally costs that much or more alone.
You could add better OCR software, and other PDF software too.

I gather what you're saying is that I need the full Adobe Acrobat in
order to create proper pdf files. Maybe that's the source of the
problem. The Lexmark presents the option of scanning to Adobe format,
but in fact it only does this extremely inefficiently.
I am not familiar with your software, so your settings dont have much
meaning to me. By Adobe format, I assume you mean PDF in this context.
I dont know that Black Text means line art mode, but it is still my guess
that it does. An 8.5x11 inch page at 300 dpi has dimensions of
(8.5 inches x 300 dpi) x (11 inches x 300 dpi) = 2550 x 3300 pixels,
which is 2550x3300 = 8.4 million pixels.

If the scan mode is line art, then 8 pixels per byte, or 1 MB.
If the scan mode is grayscale, then 1 pixel per byte, or 8 MB.
If the scan mode is color, then 3 pixels per byte, or 25 MB.

Period. There are no other answers. 8.5x11 at 300 dpi is a big image.
This data can be compressed smaller in the file.

So then my huge result is what I ought to be getting? Is this what
other people get when they produce pdf files via scanning? That seems
impossible. How could they do anything with such huge files?

Also, what does this compression involve? How is that done?
Your 1.05 MB PDF seems to have two possibilies.

1. If Black Text does mean line art, then you possibly get the right 1
MB answer, but it is not compressed, which seems wrong for PDF. If 300
dpi line art, one compressed PDF page should be closer to perhaps 115KB
(ballpark).

Or 2, possibly that the 1 MB is compressed grayscale, maybe more
likely, but I cant tell from here. Looking carefully at the image can
tell, it either contains a few gray tone pixels somewhere (grayscale), or
absolutely no gray at all, all fully black or white pixels (line art).

Using lower scan resolution will reduce the file size too (formula
above), but it may no longer print well... Good fax quality is 200 dpi
line art, I'd stay at least ther for printing. 200 dpi is about half
(4/9) the size of 300 dpi. Printing will be less crisp, but readable.

Regardless, 100 pages of line art images at 300 dpi or even 200 dpi is
going to be a huge file. Grayscale will be much larger, and line art
with compression will be your best bet for text-only pages in scanned
image form, but it is a huge job. This is simply the size of your data.

OCR into a word processor, and then writing a PDF with those text
characters will be the smallest way, probably the only viable way for 100
pages, but OCR is the most work by far, there can be errors which you
must proofread to correct. So you would want good OCR software for a
serious job, and there is some learning curve. Then I dont know what
software if any you have to write PDF from Word. Acrobat or PaperPort
Office are two ways to do that. OmniPage Pro will both do OCR and write
PDF but it is still a large job.

So, I would need both a standalone OCR program (not the built-in OCR
feature that's a part of Lexmark) and either the full Adobe program or
the equivalent.

I simply undertook this job without the proper equipment. I assumed it
would be easier than it was. I saw that "OCR" feature in the Lexmark, I
saw the "scan into Adobe format" in the Lexmark, and I foolishly assumed
that this meant I could create a viable pdf file.
It is not my concern, but it seems a favor to mention that copyright
seems a probable problem too. You said you intend to post various
articles and book sections on the web for distribution to others. Unless
you own the copyright to those works, or have specific permission from
the owner(s), this is of course copyright violation, depending on the
owners whim to sue you for damages. By owner, I dont mean who bought the
book or magazine, I mean the copyright owner that owns rights to the use
of the material.

Well, that's the concern of the person for whom I'm doing this. Most of
these writings are quite old, classic texts of political philosophy,
stuff like that.

Thanks again for the help. I really appreciate it.

Larry


 
Before I get into responding to your points, a general question: People
create pdf files by scanning hard printed pages all the time. The web
is filled with such documents. How are such documents created? There
must be a standard way of doing it. That's why I don't understand why
I'm having such a hard time finding out what that standard method is.
(But I think that is addressed below.)


Wayne is totally correct about scanning methods and although I am no expert
in this area, trial and error have been helpful to me. Perhaps your
procedure is not quite clear as a process.
I assume you are scanning each page <one at a time on your Lexmark> and the
result must arrive in a folder in your computer with a file name that you
have designated. A simple method is the document name and a number after a
dot for each page number. E.g. mybook.1, mybook.2 etc. and the suffix will
be in your case .pdf
I am surprised that a Lexmark all-in-one offers a .pdf construct in its
accompanying software, but that doesn't mean it does not exist. For example
Photoshop provides the facility of Save As in Photoshop .pdf and in my
experience this results in a very large file. If you have Photoshop and can
access your scanner through either a PlugIn or Twain Acquire, you could scan
via Photoshop at say 150 dpi and save as Photoshop .pdf with Zip compression
<available in Photoshop CS> and wind up with a relatively workable file.
Storing them in a folder and sequentially numbered allows you to complete
your .pdf via Adobe Distiller or Adobe Acrobat <full version not Reader> in
a MUCH smaller document than any other method. I do believe that the
Distiller process is the step you are missing. End result is one file in
..pdf with all 630 pages.
When you read many manuals etc created as .pdf files, you will notice how
clumsily they are put together, with the blank pages from the original
document and naturally deficiencies in numbering that throw the pages out
of sequence, so that index and contacts are a rough guide at best.
I wish I knew the 'correct way' to do this process, but the above works for
me and I treat the new material as a completely new production in .pdf and
with a little patience, make the pages appear as they should.
By the way, copyright infringements are as much your problem as the owner of
the document. You will need an acknowledgement in writing that the material
is free of copyright regardless of the fact that you are just a cog in the
wheel of production.
 
Before I get into responding to your points, a general question: People
create pdf files by scanning hard printed pages all the time. The web
is filled with such documents. How are such documents created? There
must be a standard way of doing it. That's why I don't understand why
I'm having such a hard time finding out what that standard method is.
(But I think that is addressed below.)

They are going to have pretty much the same size problems, esp for 630
page files. In older versions of Acrobat, users kept asking "why is my
PDF size 4 MB per page?" But Acrobat 6 does have better compression
now, and it does pretty well. I just tried a real quick test, Acrobat 6
Standard, and 8.5x11 inches 300 dpi color with middle JPG Quality setting
(default) was 257KB for one page PDF, and 300 dpi line art was 64KB. That
is good, amazing actually (size will vary with content, and with selected
JPG Quality setting), but I dont think it changes much if you multiply
that by 630 pages.
No one would print a hundred pages. This huge pdf file (630 pages)
consists of many short articles of between five and twenty pages.
Interested readers would print just the articles they were interested
in.

Then the notion of a 630 page PDF file seems wrong... Seems better to
instead have many short individulal PDF files, each with one topic, all
as small as possible. The web page offers descriptions and links to each
one, and the user can select those few of interest and ignore the rest.
A 20 page PDF of scanned pages is still going to be large, but much
easier to handle than 630 pages. I doubt any of them would complete a
630 page download of scanned PDF pages.

I gather what you're saying is that I need the full Adobe Acrobat in
order to create proper pdf files. Maybe that's the source of the
problem. The Lexmark presents the option of scanning to Adobe format,
but in fact it only does this extremely inefficiently.

I am not familiar with what you have now for PDF, it may be fine, but
Acrobat is the mother of all PDF and very good at what it does, which is
all things PDF. Acrobat allows one to "print" text source documents to
PDF from any program that prints, which is the normal and very best way
to use PDF (but there are a few options which could be confusing). For
example, I print my tax 1040 form to PDF from the tax software, and print
other financial reports from Quicken to PDF, and archive it together on
CD each year. Or one can scan full page images into Acrobat which is
straightforward to use, but suffers the larger size problems we've been
discussing. For color or grayscale images, Acrobat has JPG compression
with a variable quality setting, for smaller worse files or larger better
files. For Line art images, it uses a better quality method, but which
is still very effective for line art. PaperPort Office also provides
both methods, but PaperPort Deluxe only scans into PDF. The printing
method produces a relatively small and clear PDF text file (but you need
the original text source document to be printed of course), and all of
the manuals we see in PDF are of this type. The scanning method is going
to be a relatively huge file, of less quality, however file size varies
with scan resolution (appropriate for printing, or for screen viewing),
and image mode (line art, color, grayscale), and of course the number of
pages.

Sorry, I misspoke, should have said 3 bytes per pixel, or 25 MB.
Also, what does this compression involve? How is that done?

Data compression is done when the file is saved, automatically by
selecting a compression method option. In Acrobat, it is selected when
the first scan box comes up, but in others it may be an Option at the
file Save As box, and a few cases have a Preferences menu. Help menu,
search for compression I suppose. The compressed data is stored more
compactly in the file. JPG is lossy compression, but variable, and using
much of it for a smaller file size can hurt the image quality.
So, I would need both a standalone OCR program (not the built-in OCR
feature that's a part of Lexmark) and either the full Adobe program or
the equivalent.

One or the other depending on selected method, but I dont know how to
tell you what to do. The idea of this job takes my breath away for all
the reasons we've been discussing, and there doesnt seem to be a best way
to do that job. No point in saying, but it would be vastly better if
you had the original text source files instead of xerox paper copies. It
might be possible then (like those manuals we see). Doing OCR could
approximate that original source, but which seems a formidable job in
this case. It seems a safe bet you need better OCR software than came
free with a $99 scanner. This is true of all scanners, not just yours (we
dont get much for free). OmniPage Pro is good OCR that scans, and writes
PDF or almost any word processor file format. It could also read your
existing PDF as input if the scan quality was good enough, however
OmniPage Pro works best with poor quality printing at 300 dpi grayscale.
If the text is tiny, maybe 400 dpi.

All OCR makes some errors, esp so if the original printing isnt too good.
Xerox copies are notororiously poor quality and so are more difficult to
OCR accurately than pristine laser printed text. 635 pages is a lot of
work, many pages to scan and proofread and correct, no doubt some editing
too. I am glad it is not me going to try that. I am not suggesting you
do it either, I just dont know a better way to suggest. Text has the
advantage of a smaller PDF file, and a better quality file, but much more
work to get there. One should always do some experimenting to research a
plan for such an operation before beginning the work.
I simply undertook this job without the proper equipment. I assumed it
would be easier than it was. I saw that "OCR" feature in the Lexmark, I
saw the "scan into Adobe format" in the Lexmark, and I foolishly assumed
that this meant I could create a viable pdf file.

The scanner itself ought to be fine to scan the page, but less so the
free software, esp the OCR software. I dont know what you have, and it
could be an embedded product instead of addon free software. The OCR
would likely do better now if you could scan good quality printing,
instead of xerox copies. All scanners come with free OCR software that
really isnt very useful, maybe workable for small infrequent casual jobs,
but not much for large serious frequent work. The good stuff must be
purchased, but it makes a big difference, however even it is better with
good printed originals.
 
Thanks for these further answers. There's a lot here. You wrote:

It could also read your existing PDF as input if the
scan quality was good enough, however OmniPage Pro works best with
poor quality printing at 300 dpi grayscale. If the text is tiny,
maybe 400 dpi.

I gather this means that with OmnipagePro I could scan the existing pdf
files I've created into more efficient pdf files, so that I wouldn't
need to physically scan all 630 sheets of paper again. Is that correct?

If the answer is yes, then with a $100 purchase, I might be on my way.

Thanks,
Larry
 
I gather this means that with OmnipagePro I could scan the existing pdf
files I've created into more efficient pdf files, so that I wouldn't
need to physically scan all 630 sheets of paper again. Is that correct?

If the answer is yes, then with a $100 purchase, I might be on my way.


Well, you will know much more after you try it <g> I am still hesitant
to consider this astronomical size job as very feasible.

Yes, OmniPage Pro 14 will read PDF and do OCR and output the text in
various formats, for example a Word document, or back to PDF (with the
effect of changing page images to text). There are many unknowns (to
me), the quality of your pages, how it was scanned, etc. OCR results
will be best if it is 300 dpi grayscale mode. If it was like say 150
dpi, then probably not so good. The best I can tell is that you will
know much more after you experiment with it.

There is a page editor in OmniPage Pro, although it aint no Word, and
I've only done minimal stuff with it. It could come out fine
automatically, but probably wont. For example, if your xerox copies have
larger random black spots as xerox often does, the program will know this
is not text, but all it knows is to try to retain that area as a little
image on the page. Perhaps you dont care, but you probably want to clean
up that type of thing (by deletion) before going back to PDF (else it
increases file size too).

There is an acquired skill in operating good OCR packages, the first day
will not be your best day. You need to read its manual and to learn its
zone concepts. It is NOT trivial. It does have auto modes, and it will
mark the zones automatically, usually very well, but humans can see there
are sometimes other better plans, or you have different goals. You can
generally do area deletions there too, by clearing zones (the unwanted
image spots).

You actually have a 630 page PDF file now? Hard to imagine, but that
would seem a staggering load on a computer, and I would expect that total
load to crawl, esp since it is all scanned pages. I recently helped a
friend with OmniPage Pro, a 300 MHz laptop and 128 MB memory, to read a
128 page PDF that was text, only a few small images in it, and so there
wasnt any actual OCR effort, it was already text, and didnt need
validation. This is a few magnitudes less than your effort, and it took
a few hours to process it and write a Word file. The little laptop was
acceptably fast with only a few pages, but the total memory requirement
was staggering. It made it however.

As to OmniPage working with your scanner... OmniPage Pro should work with
any Twain scanner, so no problem would be expected. However your Lexmark
scanner docs and specs do not seem to mention the word Twain. I really
cannot imagine it isnt twain, but I dont know about it.

Simple to determine however. What other photo editor do you have? If
none, you could download the free IrfanView from www.irfanview.com (its a
great little viewing utility anyway) and try scanning from it (into it)
via its menu File - Select Scanner to select the Lexmark, and then File -
Acquire to scan a test image with it. If that works, then OmniPage should
work too.
 
"PDF is not designed to be able to get your data out of a PDF file for a
second try (at least Acrobat is not)."

So it seems you're saying that while Omnipage Pro can open a pdf file
and resave it at a smaller size, Adobe does not have this ability.

I just called both companies and their answers seem to confirm what you
said. Omnipage customer service told me I could open and re-save the
existing pdf at a smaller size. Adobe customer service wasn't sure and
they are getting back to me.

But what I'm thinking of doing is finding a computer service where you
can pay for the use of a computer by the hour that has the Omnipage
program on it, bring my pdf files there, and see if this will work.
 
Well said:
hesitant to consider this astronomical size job as very feasible.

Yes, OmniPage Pro 14 will read PDF and do OCR and output the text in
various formats, for example a Word document, or back to PDF (with the
effect of changing page images to text). There are many unknowns (to
me), the quality of your pages, how it was scanned, etc. OCR results
will be best if it is 300 dpi grayscale mode. If it was like say 150
dpi, then probably not so good. The best I can tell is that you will
know much more after you experiment with it.

The original photocopies I scanned from were very clean and dark. I
scanned in black and white, 300 dpi.

What I have is 10 pdf files, ranging between 15 pages and 132 pages,
totaling 630 pages. The total size for all of them is 545 MB.

Thanks again.

Larry
 
The original photocopies I scanned from were very clean and dark. I
scanned in black and white, 300 dpi.

What I have is 10 pdf files, ranging between 15 pages and 132 pages,
totaling 630 pages. The total size for all of them is 545 MB.

That is sounding good then. Start with a small one.
 
Back
Top