T
Tristan Miller
Greetings.
I've volunteered to help a publisher produce a digital archive of their
newspaper. The newspaper has been printed monthly since 1904 on A4 paper,
with about 20 pages per issue. Issues until about 1965 are
black-and-white, then spot colour until around 2003. My task will be to
scan the printed copies (up to about 1995; thereafter I have access to the
original electronic files) and produce OCR'd PDFs for distribution on
CD/DVD/Internet.
I thought I'd ask for some tips or recommendations on the following
aspects:
1) What sort of scanning DPI is typically used nowadays to archive
documents? I have two high-speed professional RICOH scanners which can do
up to 600 dpi.
2) The RICOH devices have a "Text OCR" setting with dropout colour, which I
presume is best for postprocessing the image with OCR software. (The
scanner does not do OCR itself.) The resulting image is a 1-bit TIFF.
There are also settings for grayscale and colour JPEGs.
Any suggestions on what scan settings I should use for the black and white
pages, and for the spot-colour pages?
I presume that for the spot colour pages, I should scan once with the "Text
OCR" setting, for the purpose of OCR, and then once again with the
full-colour JPEG setting for presentation purposes. That is, the JPEG
images will be stitched together to form a PDF, with the OCR text captured
from the TIFF image "underneath".
For the black and white pages, would it make any sense to take a similar
approach? That is, should I make a grayscale scan of the page, or will
the 1-bit TIFF look good enough in a PDF?
3) Any recommendations for OCR software? I am working on a GNU/Linux
machine and have gocr and ocrad installed, but don't have much experience
with them. I would prefer to use free/open-source software, but can
obtain an MS-Windows machine and commercial OCR software if necessary. As
mentioned above, I will need the software to be able to make PDFs with
text "underneath" a TIFF or JPEG image. This way the user will see the
original scanned page in his PDF viewer, but will also be able to select
the text with the mouse or search for it with the Find tool.
Because of the huge volume of newspapers I have to process, my primary
criterion for the OCR software is that it should be as close to "batch
mode" as possible -- I want it to run with minimum user interaction.
Regards,
Tristan
I've volunteered to help a publisher produce a digital archive of their
newspaper. The newspaper has been printed monthly since 1904 on A4 paper,
with about 20 pages per issue. Issues until about 1965 are
black-and-white, then spot colour until around 2003. My task will be to
scan the printed copies (up to about 1995; thereafter I have access to the
original electronic files) and produce OCR'd PDFs for distribution on
CD/DVD/Internet.
I thought I'd ask for some tips or recommendations on the following
aspects:
1) What sort of scanning DPI is typically used nowadays to archive
documents? I have two high-speed professional RICOH scanners which can do
up to 600 dpi.
2) The RICOH devices have a "Text OCR" setting with dropout colour, which I
presume is best for postprocessing the image with OCR software. (The
scanner does not do OCR itself.) The resulting image is a 1-bit TIFF.
There are also settings for grayscale and colour JPEGs.
Any suggestions on what scan settings I should use for the black and white
pages, and for the spot-colour pages?
I presume that for the spot colour pages, I should scan once with the "Text
OCR" setting, for the purpose of OCR, and then once again with the
full-colour JPEG setting for presentation purposes. That is, the JPEG
images will be stitched together to form a PDF, with the OCR text captured
from the TIFF image "underneath".
For the black and white pages, would it make any sense to take a similar
approach? That is, should I make a grayscale scan of the page, or will
the 1-bit TIFF look good enough in a PDF?
3) Any recommendations for OCR software? I am working on a GNU/Linux
machine and have gocr and ocrad installed, but don't have much experience
with them. I would prefer to use free/open-source software, but can
obtain an MS-Windows machine and commercial OCR software if necessary. As
mentioned above, I will need the software to be able to make PDFs with
text "underneath" a TIFF or JPEG image. This way the user will see the
original scanned page in his PDF viewer, but will also be able to select
the text with the mouse or search for it with the Find tool.
Because of the huge volume of newspapers I have to process, my primary
criterion for the OCR software is that it should be as close to "batch
mode" as possible -- I want it to run with minimum user interaction.
Regards,
Tristan