How to scan text pages for smallest-size pdf?

  • Thread starter Thread starter Al
  • Start date Start date
A

Al

Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.

What are the most efficient settings to use when scanning with this
goal in mind? These pages are photocopies of abstracts from
scientific journals, so most pages are all text; a few have charts
or graphs.

We've tried scanning some pages at 72dpi but they're not readable
on screen. When we scan them at 100+ dpi the resulting pdf is
pretty large (a 7-page doc turned into a 2MB pdf, which seems
too big).

Meanwhile, someone sent us a 75pp document scan and the pdf
was only 1MB! Unfortunately they didn't create the pdf, so they
don't know why it has such a small file-size.

Any tips are appreciated.
 
Your scan was pdf-encoded as a graphic, not as text. You need to use the
original document files, or optical-character-recognize (OCR) your scans to
recover the text portions as text, not graphics, then encode to pdf .
-Dave
 
----- Original Message -----
From: "Al" <>
Newsgroups: comp.periphs.scanners
Sent: Thursday, October 13, 2005 8:03 PM
Subject: How to scan text pages for smallest-size pdf?

Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.

What are the most efficient settings to use when scanning with this
goal in mind? These pages are photocopies of abstracts from
scientific journals, so most pages are all text; a few have charts
or graphs.

We've tried scanning some pages at 72dpi but they're not readable
on screen. When we scan them at 100+ dpi the resulting pdf is
pretty large (a 7-page doc turned into a 2MB pdf, which seems
too big).

Meanwhile, someone sent us a 75pp document scan and the pdf
was only 1MB! Unfortunately they didn't create the pdf, so they
don't know why it has such a small file-size.

Any tips are appreciated.

You have two options.
One of which provides the smallest size files was explained by Dave and
advising you to OCR.

The only other "reasonable" option is to scan from within Acrobat having
selcted
"Black and White/Line Art"
(or what ever your software calls it) at 150DPI.
Then save as PDF.

I frequently scan small font, two column text at 400 dpi with up to 10 pages
in
"Black and White/Line Art" and the file size rarely exceeds 600k.

I found 150 (your 72 is ineffective) to be the lowest recognizable and
printable setting. However, even this may dependent on the quality of the
printed materials that you are scanning from.

In the event that your attempting to scan in color?
Forget about it! It's just not possible to get the file size down to
anything reasonable.
 
lostinspace said:
----- Original Message -----
From: "Al" <>
Newsgroups: comp.periphs.scanners
Sent: Thursday, October 13, 2005 8:03 PM
Subject: How to scan text pages for smallest-size pdf?


Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.

[snip]

Any tips are appreciated.


You have two options.

[snip]

A third option I can think of... don't know if something like this is
available inside Acrobat, but it's certainly possible in theory.

Convert the scan to vector graphics. That won't have the accuracy
problems of OCR, and it should still gain a good size advantage.

But anyway, what options does Acrobat offer for image compression? We're
talking black and white text: I though standard (lossless or lossy)
compression methods could shrink such data to good extents.


by LjL
(e-mail address removed)
 
A third option I can think of... Convert the scan to vector graphics.
That won't have the accuracy problems of OCR, and it should still gain
a good size advantage.

What? Scanners produce raster images, and I'd think that in the general
case, raster->vector would *add* size rather than subtract it.
But anyway, what options does Acrobat offer for image compression?
We're talking black and white text: I though standard (lossless or
lossy) compression methods could shrink such data to good extents.

Group4 TIFF is lossless and extremely efficient at compressing things.
An 8.5x11" page scanned at 300DPI in Group4 will be about 50-100K
depending on image complexity and how many black pixels you have. It'd
be smaller if it were scanned at 150DPI, of course. I don't know
whether Acrobrat uses Group4 automagically for black-n-white source
images, but it might. It might also do something stupid. Try it and
see.

Of course, for text pages, OCRed ASCII/ISO-8859-15 is smaller than any
image format and you can grep it. OCR accuracy depends a lot on how
clean the source image is. Something that was printed on a decent
printer, scanned straight, and didn't have any dirt on it should give
pretty high accuracy with a recent commercial OCR engine. If you need
100% accuracy, though, you'll have to have a human proofread it and
correct it. This takes forever and is boring as hell.
 
Al said:
Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.

What are the most efficient settings to use when scanning with this
goal in mind? These pages are photocopies of abstracts from
scientific journals, so most pages are all text; a few have charts
or graphs.

We've tried scanning some pages at 72dpi but they're not readable
on screen. When we scan them at 100+ dpi the resulting pdf is
pretty large (a 7-page doc turned into a 2MB pdf, which seems
too big).

Meanwhile, someone sent us a 75pp document scan and the pdf
was only 1MB! Unfortunately they didn't create the pdf, so they
don't know why it has such a small file-size.

Any tips are appreciated.

I usually scan text documents using 300 dpi black & white setting (CanoScan 5200F). Each A4-page then tends to be somewhere along 70 - 80 kB. I think 300 dpi at BW is well enough readable (even 200 dpi (about 50 kB each A4-page) scans is readable, but probably not too good if printed).

Using grayscale 100 dpi (160-170 kB) is readable and 150 dpi (300-350 kB) is 'good enough', but then the files are obviously very much larger.

If I OCR the document and then pdf it, it gets down to about 25 kB, but then the process takes longer time to complete and is very much harder to automate (using the OCR SW that came with the scanner - ScanSoft OmniPage 2.0 SE).

PerL
 
First of all, 2M is not big by todays attachment standards. It's only a
performance hit for dial-up users and people whose mail servers admins
haven't caught up yet -- even hotmail allows 250M and 5M attachments!

That said, why not examine the 1M pdf and see what you can figure out (you
can send it to me at pdf at dolman period ca -- ca, not com) -- if you
want). Check the properties, e-mail Adobe, test various settings (including
those suggested here). You could also ditch pdf and scan to jpg (very
compatible and compressible). Does it have to be pdf? How about ZIPping the
pdf file before attaching it?

Lots of soultions, lots of options.

Good luck. :-)
 
Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.

What are the most efficient settings to use when scanning with this
goal in mind? These pages are photocopies of abstracts from
scientific journals, so most pages are all text; a few have charts
or graphs.

We've tried scanning some pages at 72dpi but they're not readable
on screen. When we scan them at 100+ dpi the resulting pdf is
pretty large (a 7-page doc turned into a 2MB pdf, which seems
too big).

The most size-efficient way I know is to scan in black and white
and compress with the CCITT Group 4 or LZW algorithm. I store
text scans in TIFF/Group 4 format (scanimage --mode lineart |
pnmtotiff -g4 >foo.tif) and the size is on the order of 100 kB
per A4 page at 600 DPI (50 kB at 300 DPI). If PDF supports Group
4 compression, and I think it does, you'll get similar figures.

For on-screen reading, it's often preferable to scan at lower
resolutions (around 70-100 DPI) but in greyscale, and either (a)
quantise to somewhere between 3 and 8 grey levels (pnmdepth 2)
and use a lossless compression algorithm like PNG (scanimage
--mode greyscale | pnmdepth 3 | pnmtopng >foo.png) or (b) use a
lossy algorithm like JPG.
 
Andre said:
[snip]

[...]
quantise to somewhere between 3 and 8 grey levels (pnmdepth 2)
and use a lossless compression algorithm like PNG (scanimage
--mode greyscale | pnmdepth 3 | pnmtopng >foo.png) or (b) use a
lossy algorithm like JPG.

Heeey, someone using Unix, SANE and NetPBM! I almost thought I was alone
here :-)

What scanner are you using with SANE?


by LjL
(e-mail address removed)
 
Al said:
Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to. snip
Any tips are appreciated.

scan as BW, or Line Art, or one bit

do NOT scan as grey scale, half tone, or any of the color options.
 
Andre said:
[snip]

[...]
quantise to somewhere between 3 and 8 grey levels (pnmdepth 2)
and use a lossless compression algorithm like PNG (scanimage
--mode greyscale | pnmdepth 3 | pnmtopng >foo.png) or (b) use a
lossy algorithm like JPG.

Heeey, someone using Unix, SANE and NetPBM! I almost thought I was alone
here :-)

That's two of us, then. :-) I don't use GUIs unless I have to.
What scanner are you using with SANE?

A small HP ScanJet C7670A with the automatic sheet feeder. The
colours are way off and it's rather more expensive than the
competition but it was the only sheet feeder I could find at the
time. A couple models from Epson et al. purported to have a
sheet feeder option but it turned out to be vapourware.

I'm happy with it but I'm thinking about a second, bigger,
scanner (A3 perhaps). Just testing the waters, you know. :-)
 
Dances said:
What? Scanners produce raster images, and I'd think that in the general
case, raster->vector would *add* size rather than subtract it.

Uh? In the case of "block graphics" (I mean black and white graphics
with large areas of black and large areas of white) vector will
definitely be smaller...

I'm not really sure about text: if you scan at a low resolution and then
compress decently (such as the way you've mentioned below, that I
snipped), I think you could possibly be better off with raster.

On the other hand, you could scan at a higher resolution and then
convert to vector; this would have the advantage of being "infinite
resolution" -- not really, but you can print it at any size and not see
jaggies.

by LjL
(e-mail address removed)
 
That's two of us, then. :-)

Make that two-and-a-half... ;o)
I don't use GUIs unless I have to.

Hear! Hear!

I only use Linux occasionally (not for scanning, though) but never use
a GUI with it. Indeed, I always log in as root which drives all my
Linux acquaintances nuts! ;o)

Don.
 
Back
Top