Cleaning up original Burton "Kama Sutra" page scans -- need advice/help

  • Thread starter Thread starter Jon
  • Start date Start date
J

Jon

I'm now working to "clean up" the 182 page images from a recent scan
of a very rare and noteworthy public domain book. The cleaned-up scans
will be released to the public (such as given to the Internet Archive)
for free access. [For those interested, the book is the 1885 second
printing of the second edition of Sir Richard F. Burton's "Kama Sutra
of Vatsyayana".]

The scans were done at 600 dpi (optical) 256-color greyscale (there's
no color in the book), to capture sufficient fine-detail to aid in the
cleanup process. Of course, the book was chopped (the binding was
falling apart anyway) and each page scanned on a flat-bed, so there's
no page distortion caused by trying to scan a bound book. There are no
illustrations -- it's all black and white text.

I've already deskewed, cropped, centered and size-normalized all 182
pages. (For those interested, links to two sample partially-cleaned
pages are given below.)

In the cleanup process, I'd like to convert what I now have into
600-dpi *bitonal* (black and white) with uniform and nicely readable
character density, removal of "pepper", cleanup of larger blotches,
etc. I recognize there will be some handwork required, particularly to
remove larger "pepper" and blotches, and repair a few characters,
etc., but of course want to minimize handwork.

[Note that the purpose of the cleanup is for direct human-use of the
scans, and not solely for OCR purposes which doesn't require the
planned level of cleanup. For example, I plan to produce a DjVu
version for direct reading. For those who will probably ask, the raw
page scans have already been uploaded to Distributed Proofreaders for
conversion to structured digital text.]

Unfortunately, what complicates the clean-up process is that the
original book is in poor and variable condition. The paper is quite
yellowed and darkened, and many pages are quite faded. Were the
original in mint condition with good, uniform ink-to-paper contrast, I
wouldn't be posting this request for advice. But the overall poor
quality and page-to-page variation is taxing my graphics abilities to
produce a clean finished product with reasonably readable and uniform
character density (at 600-dpi bitonal.)

Here are two sample pages, each about 4.5 megs in size (2550x3900
greyscale):

http://www.openreader.org/kamasutra/page031.png (good condition)
http://www.openreader.org/kamasutra/page106.png (poor condition)

I would assume that others have had similar needs and have come up
with various processing tricks and even built special tools to aid in
the clean-up process (e.g., how to auto-remove small "pepper", the
one to few pixel wide black spots on the white background?). I look
forward to your advice and even help if you are interested (I will
upload all the partially-cleaned images somewhere if you want to help
with the actual clean-up process -- the whole set of images totals 680
megs.)

[As a final note, I use Paint Shop Pro 9, but do not have Photoshop.
But since PSP9 is fairly powerful, I assume that many, if not all,
recommended Photoshop processes will map over to PSP9.]

Thanks!

Jon Noring
 


After examining both pages, I recommend you use the curves function in
Photoshop to adjust the contrast. This will eliminate the background tint
and darken the text in a couple of steps.

While in grayscale, change your shadow dot (upper right corner) to Input
90, Output 100, or whatever value you think works best. Change your
highlight to Input 10, Output 0, or whatever value gives you the best
result.

I would put the good and bad page files in separate folders, batch them,
and then do the finishing cleanup on the rough pages with Photoshop's
eraser and stamp pad tools.
 
[As a final note, I use Paint Shop Pro 9, but do not have Photoshop.
But since PSP9 is fairly powerful, I assume that many, if not all,
recommended Photoshop processes will map over to PSP9.]


My last PSP was 7, so am unsure what might be in the current version.
The key in mapping grayscale to line art is the Threshold control. Line
art threshold is the division line between those pixels that will become
white, and those that will become black. The key is being able to set
the threshold as appropriate to the data that you have. Controls such
as the PSP "Decrease color depth" to 2 colors is often poor in
comparison to seeing and using the optimum threshold.

On your "poor" example, use PSP menu Colors - Adjust - Threshold
and set the threshold (between what will be black and what will be
white) at about 212. Seems like a good result to me (in this one case).

You are working blind in PSP however. YOu can see the result preview,
but you cannot see the data histogram to guide you. Photoshop has a
similar Threshold tool, but with the strong advantage that the tool also
shows the histogram of the data as an obvious guide that on this one
image (your "poor" example) that the white background peak (all the
white pixels) occurs just above 212. Without seeing the histogram, you
will have to trial and error it, but most pages are probably very
similar.
 
Before going to gray scale you should do a color channel
separation. The blue channel is usually very noisy and
should be discarded. Choose between the red and green
channel for the cleanest page. Convert this best channel
to gray scale. Use level command to remove the gray
and blacken up the letters as much as possible.

Rather than trying to retouch bad letters find a good example
of that letter and do a cut and paste over the poor letter.

Jon said:
I'm now working to "clean up" the 182 page images from a recent scan
of a very rare and noteworthy public domain book. The cleaned-up scans
will be released to the public (such as given to the Internet Archive)
for free access. [For those interested, the book is the 1885 second
printing of the second edition of Sir Richard F. Burton's "Kama Sutra
of Vatsyayana".]

The scans were done at 600 dpi (optical) 256-color greyscale (there's
no color in the book), to capture sufficient fine-detail to aid in the
cleanup process. Of course, the book was chopped (the binding was
falling apart anyway) and each page scanned on a flat-bed, so there's
no page distortion caused by trying to scan a bound book. There are no
illustrations -- it's all black and white text.

I've already deskewed, cropped, centered and size-normalized all 182
pages. (For those interested, links to two sample partially-cleaned
pages are given below.)

In the cleanup process, I'd like to convert what I now have into
600-dpi *bitonal* (black and white) with uniform and nicely readable
character density, removal of "pepper", cleanup of larger blotches,
etc. I recognize there will be some handwork required, particularly to
remove larger "pepper" and blotches, and repair a few characters,
etc., but of course want to minimize handwork.

[Note that the purpose of the cleanup is for direct human-use of the
scans, and not solely for OCR purposes which doesn't require the
planned level of cleanup. For example, I plan to produce a DjVu
version for direct reading. For those who will probably ask, the raw
page scans have already been uploaded to Distributed Proofreaders for
conversion to structured digital text.]

Unfortunately, what complicates the clean-up process is that the
original book is in poor and variable condition. The paper is quite
yellowed and darkened, and many pages are quite faded. Were the
original in mint condition with good, uniform ink-to-paper contrast, I
wouldn't be posting this request for advice. But the overall poor
quality and page-to-page variation is taxing my graphics abilities to
produce a clean finished product with reasonably readable and uniform
character density (at 600-dpi bitonal.)

Here are two sample pages, each about 4.5 megs in size (2550x3900
greyscale):

http://www.openreader.org/kamasutra/page031.png (good condition)
http://www.openreader.org/kamasutra/page106.png (poor condition)

I would assume that others have had similar needs and have come up
with various processing tricks and even built special tools to aid in
the clean-up process (e.g., how to auto-remove small "pepper", the
one to few pixel wide black spots on the white background?). I look
forward to your advice and even help if you are interested (I will
upload all the partially-cleaned images somewhere if you want to help
with the actual clean-up process -- the whole set of images totals 680
megs.)

[As a final note, I use Paint Shop Pro 9, but do not have Photoshop.
But since PSP9 is fairly powerful, I assume that many, if not all,
recommended Photoshop processes will map over to PSP9.]

Thanks!

Jon Noring
 
Dave said:
Ray wrote:
He *scanned* them as grey scale. There ain't no color channels.

Thanks for the feedback.

This is very useful information to know, since I'm interested in the
more general area of archival quality scanning. One aspect of such
scanning is to be able to do image post-processing. Having full color
information allows one to work in various color channels which may
produce better post-processing results. It makes sense to filter out
the blue for page scan processing since yellowed paper is usually
"brighter" in the reds and greens, thus accentuating the difference
between the paper and the ink.

The downside to working in 24-bit color for page scans is that the
resulting images are much larger than the 8-bit greyscale, requiring
that much more space to store the raw scans.

But one lives and learns.

Thanks again.

Jon Noring
 
Jon said:
This is very useful information to know, since I'm interested in the
more general area of archival quality scanning. One aspect of such
scanning is to be able to do image post-processing. Having full color
information allows one to work in various color channels which may
produce better post-processing results. It makes sense to filter out
the blue for page scan processing since yellowed paper is usually
"brighter" in the reds and greens, thus accentuating the difference
between the paper and the ink.

I agree with you that you have a LOT more options if you scan in color
mode. Not only do you have the option of selecting a channel as you
suggested, but can use L*a*b space, blend channels, etc to achieve the
best results.

In this case, the OP is already in greyscale mode, alas.
 
Back
Top