Guidelines on archival scanning of books?

  • Thread starter Thread starter Jon Noring
  • Start date Start date
J

Jon Noring

Everyone,

Since I have several older, public domain books I'd like to scan (and
the scans will be made public, such as donation to the Internet
Archive), I'd like to get feedback from the experts here regarding
what level of quality is recommended (i.e., optical resolution, color
depth, etc., etc.)

Obviously, the first reply to this inquiry is "it depends on what the
scans are to be used for."

The principle I would like to follow is that the scans are to become
a public digital representation of these books -- to have a
multiplicity of future uses -- and to be *reasonably* sufficient
replacements should, hypothetically speaking, all copies of the
original paper copies disappear.

The word "reasonably" is emphasized since there are extremes that
could be considered (and which will be rejected.) For example, one
extreme is to take apart the books and drum scan each individual page
at a whopping 2500 dpi or greater at 32 bit color depth -- a very slow
and laborious process which creates humongous image files exceeding,
even when losslessly compressed, 200 megs apiece. (Such scans could be
used to make fascimile lithographic print copies that equal and, with
smart image processing, even surpass the original in print quality.
Note that I will not consider any non-digital technology, such as
film.)

I've been looking online for such guidelines, but haven't yet found
anything I consider substantive and authoritative -- the final word on
the topic (but then maybe I'm looking in all the wrong places.) The
few "archivist" forums I've found are either dead or have restricted/
exclusive access. If the information I seek is online, or if there is
a better online forum I can repost this message to, let me know. Feel
free to forward it to other forums.

Thanks.

Jon Noring

(p.s., the lossy, open standards 'djvu' format is intriguing because
it greatly compresses scans yet seems to preserve a high-degree of the
scan quality. I'd like feedback on the suitability of using 'djvu' for
archiving the scans -- is it still recommended to store the masters
of the scans in some lossless compressed format, such as PNG?)

(p.p.s., please no discussion on the public domain/copyright aspects.
The books to be scanned are original printings, printed in the late
19th century, and which are Public Domain in both the U.S. and
world-wide -- I regularly consult Stephen Fishman's book "The Public
Domain", so am aware of the copyright aspects.)
 
Jon Noring said:
Since I have several older, public domain books I'd like to scan (and
the scans will be made public, such as donation to the Internet
Archive), I'd like to get feedback from the experts here regarding
what level of quality is recommended (i.e., optical resolution, color
depth, etc., etc.)

Project Gutenberg Distributed Proofreaders -- who you'll also donate the
scans to, right? :-) -- suggest "300dpi, black and white (not
grayscale), and average brightness unless the paper is very yellow.
Higher dpi doesn't necessarily make for better OCR unless the text is
extremely small. You want to end up with good, reasonably clean images
that the OCR software won't choke on."
( http://www.pgdp.net/c/faq/scan/submitting.php#scan )

Project Gutenberg itself suggests 300-600dpi, and points out "A further
paradox emerges when considering higher vs. lower resolutions: depending
on the paper and ink quality, you may see more errors start to appear on
very high resolution scans. These are caused by small imperfections in
the paper or ink spots that show up on the high-res scan, and that the
OCR tries to interpret as letters or punctuation."
( http://www.gutenberg.org/faq/S-10.php )

I'd generally expect that if it's good enough dpi that a computer can
'read' it, it's good enough for a human, so I'd think that 300-600dpi
should be fine for most purposes.

File type is another issue; Distributed Proofreaders uses .png but I
don't know how standard this is.

Zeborah
 
["Followup-To:" header set to comp.periphs.scanners.]
Project Gutenberg itself suggests 300-600dpi, and points out "A
further paradox emerges when considering higher vs. lower resolutions:
depending on the paper and ink quality, you may see more errors start
to appear on very high resolution scans. These are caused by small
imperfections in the paper or ink spots that show up on the high-res
scan, and that the OCR tries to interpret as letters or punctuation."
( http://www.gutenberg.org/faq/S-10.php )

AOL! Use 300 dpi if the text is normal-sized, 600 if the text is really
small ( < 8 point ) or really detailed (Japanese/Chinese text). If
you're not going to OCR the images and the type is normal-sized, you may
be able to get away with 150dpi.
I'd generally expect that if it's good enough dpi that a computer can
'read' it, it's good enough for a human, so I'd think that 300-600dpi
should be fine for most purposes.

Jon should say whether the originals are just English text (300dpi,
black-and-white), English text with lots of grayscale illustrations
(300dpi, grayscale), or English text with lots of color illustrations
(300dpi, 24-bit color).
File type is another issue; Distributed Proofreaders uses .png but I
don't know how standard this is.

PNG is an Open standard, so you can't really go wrong there. The main
thing to be aware of is that Group4 TIFF provides much better
compression than PNG does if your image is black-and-white. If the
images are grayscale/color, LZW TIFF may provide better compression, or
it might not. It's always possible to read Group4 and LZW TIFF, but
writing LZW TIFF may be difficult and/or annoying depending on the
software package you're using (damn you, Unisys, damn you to hell.)
HTH,
 
Zeborah said:
Jon Noring wrote:
Project Gutenberg Distributed Proofreaders -- who you'll also donate
the scans to, right? :-) -- suggest "300dpi, black and white (not
grayscale), and average brightness unless the paper is very yellow.
Higher dpi doesn't necessarily make for better OCR unless the text
is extremely small. You want to end up with good, reasonably clean
images that the OCR software won't choke on."
( http://www.pgdp.net/c/faq/scan/submitting.php#scan )

Yep, DP is the other major recipient, probably of lower-resolution
(300dpi) B&W images batch converted from the archive versions, which
DP will use for OCR purposes.

DP's recommendations focuses on OCR, while my focus is on multi-use
capability. It takes a lot of effort to scan a book, and sometimes a
person only has one shot at some copies, so it's better to err on the
side of higher resolution and more color depth (even if only B&W.) The
downside to this, of course, is much larger images.

File type is another issue; Distributed Proofreaders uses .png but I
don't know how standard this is.

PNG is an excellent open-standards lossless format. Very good
compression, and unlike GIF is capable of 24-bit color. All of today's
web browsers provide native PNG support. There's no longer any reason
to use GIF since PNG provides better lossless compression.

Jon Noring
 
Jon Noring said:
PNG is an excellent open-standards lossless format. Very good
compression, and unlike GIF is capable of 24-bit color. All of today's
web browsers provide native PNG support. There's no longer any reason
to use GIF since PNG provides better lossless compression.

Thanks for the information; I'm in the planning stages of a
digitalisation project at school and was just getting around to working
out whether to go for TIFF or PNG. It sounds, from what you and Dances
With Crows say, as if the latter will be better for us.

Zeborah
 
Zeborah said:
Jon Noring wrote:
Thanks for the information; I'm in the planning stages of a
digitalisation project at school and was just getting around to
working out whether to go for TIFF or PNG. It sounds, from what you
and Dances With Crows say, as if the latter will be better for us.

As Dances With Crows noted in his reply to my message, for bi-color
(black and white) images, the CCITT Group 4 provides much better
lossless compression than does PNG. But then web browsers don't
natively support that compression standard used within TIFF files. PNG
is a good all-around lossless compression scheme (and if you plan to
compress greyscale or color scans, then PNG is about the best for a
general compression algorithm.) Anyway, one can always convert back
and forth between PNG and CCITT4, so it's not as if one is restricted
to one or the other. It's when one deals with lossy schemes (such as
JPEG and DjVu) that one gets into conversion issues and dealing with
introduced artifacts.

Jon Noring
 
Zeborah said:
Project Gutenberg Distributed Proofreaders -- who you'll also donate the
scans to, right? :-) -- suggest "300dpi, black and white (not
grayscale), and average brightness unless the paper is very yellow.
Higher dpi doesn't necessarily make for better OCR unless the text is
extremely small. You want to end up with good, reasonably clean images
that the OCR software won't choke on."
( http://www.pgdp.net/c/faq/scan/submitting.php#scan )

In my experience with doing this sort of thing [1], 600dpi two-color B&W
scans produce printouts that look considerably cleaner to the eye than
300dpi scans do. And, if file size is an issue (and you're scanning
black/white images), it seems to produce a better result in the final
printouts to spend the file size on extra resolution, rather than on
doing grayscale vs. two-color B&W; the 600dpi two-color produces better
results than the 300dpi grayscale, too.

If you've got photos or artwork or color in the book too, things may get
a bit more complicated. One solution that I've seen for some archival
scans is to do one set of scans of everything at settings that work well
for the text, and then rescan the photos and artwork at some higher
resolution and/or higher color depth.

- Brooks

[1] Turning handwritten solution sets into pdfs for a college class, and
scanning old copies of the TeX Users Group journal for online
publication.
 
Project Gutenberg Distributed Proofreaders -- who you'll also donate the
scans to, right? :-) -- suggest "300dpi, black and white (not grayscale),
and average brightness unless the paper is very yellow.

I'd at least scan in in grayscale and convert later to black and white - made
some good experience with "mkbitmap" which is part of the "potrace" package
at <http://potrace.sourceforge.net/>:

Highpass filtering surpresses large-scale irregularities such as background
variations, while preserving small-scale detail such as lines. Filtering
depends on a parameter called the filter radius, which corresponds roughly
to the size of features that are preserved. The filter radius can also be
identified with line thickness.

Andreas
 
(p.s., the lossy, open standards 'djvu' format is intriguing because
it greatly compresses scans yet seems to preserve a high-degree of the
scan quality. I'd like feedback on the suitability of using 'djvu' for
archiving the scans -- is it still recommended to store the masters
of the scans in some lossless compressed format, such as PNG?)

Hey Jon:

Long time no hear. The "open standards" of DJVU is only for the reader.
The format is patented -- originally by AT&T as I recall and no is licensed
to LizardTech who sells the compression technology.

-Art
 
Andreas said:
I'd at least scan in in grayscale and convert later to black and white - made
some good experience with "mkbitmap" which is part of the "potrace" package
at <http://potrace.sourceforge.net/>:

Highpass filtering surpresses large-scale irregularities such as background
variations, while preserving small-scale detail such as lines. Filtering
depends on a parameter called the filter radius, which corresponds roughly
to the size of features that are preserved. The filter radius can also be
identified with line thickness.

That is, indeed, a very nice piece of software. In my experience,
though, it's not really worth the trouble if you have good clean scans
to start with -- even when I was converting handwritten pencil marks on
yellowgreen "engineering" paper to black/white two-color images, I got
sufficient results by scanning in grayscale and using a manual
threshhold setting (with preview) to convert. On things that are actual
printed text on decent paper, I've had perfectly good results from
black/white scans.

It might be useful for things where the paper has discolored or such,
though, particularly if the discoloring isn't even from one part of the
page to another. And it's a good tool to have around.

- Brooks
 
Back
Top