Scanning large number textual pages

  • Thread starter Thread starter Pavel Konev
  • Start date Start date
P

Pavel Konev

I have collected more than 15000 purely textual pages
(no images) over years.
I want all of them to digitalize and store on my computer.

Could you please recommend efficient scanner model,
as well as relevant software needed to do this job.

I am looking for scanner that is not expensive
(in the range of say 300-400 US$).

Thank you very much in advance.
 
Pavel Konev said:
I have collected more than 15000 purely textual pages
(no images) over years.
I want all of them to digitalize and store on my computer.

Could you please recommend efficient scanner model,
as well as relevant software needed to do this job.

I am looking for scanner that is not expensive
(in the range of say 300-400 US$).

Thank you very much in advance.

If you just want to store the image of the documents, just about any of
today's flatbed scanner will do fine.

All you need is about 300-600 DPI for text documents.

If you store the scanned image as a TIFF and you will not lose any of the
information contained in the original file, no matter how many times the
image is re-saved.

If you want to have the documents in an editable form, you have to perform
OCR on the scanned documents. OCR only works on typed or computer printed
documents, not hand written.
OCR programs run from very poor to pretty good.

The best OCR programs are Omnipage and Abbyy Finereader.
http://www.scansoft.com/omnipage/
The best price for Omnipage is found at http://www.scantips.com

http://www.abbyy.com/finereader_ocr/
 
Pavel said:
I have collected more than 15000 purely textual pages
(no images) over years.
I want all of them to digitalize and store on my computer.

Could you please recommend efficient scanner model,
as well as relevant software needed to do this job.

I am looking for scanner that is not expensive
(in the range of say 300-400 US$).

Thank you very much in advance.

I did a test with the MS Office Document Scanning program and a Canon
D1250 scanner at 300dpi. A document with 1234567890 in Arial at sizes
12, 16, 24, 36 and 48 points was successfully scanned and converted to
digits. The resulting document used 12 points for each sample.
 
Your problem is 1) speed & 2) particularly time if needing OCR.

Price...
o For 15,000 pages people tend to use Document Scanners
---- they are loose-sheet-fed - not book or stapled
o Unfortunately 300-400$US rules out such scanners
---- Ebay is one way, but repair/warranty an issue

That leaves you wish a flatbed scanner:
o You need to identify the fastest for your application
---- dpi - anything will do, you only need 200dpi, 300dpi is ideal
---- lps - lines per second is an important figure re speed
o Some scanners have a slow between scan cycle time
---- some Canon have very fast scanning speed re lps
---- however they go thro a light-cycle before each scan
---- so actually reducing the pages per hour you can do

Time...
o For 15,000 pages a typical fast flatbed is 300 A4/hour
o So you are looking at 50hrs, 50 days doing 1hrs a day
o Bear in mind that 1hr will be intensively numbingly boring

OCR...
o OCR is still only about 89-92% accurate
---- good - until you realise how many words you need to correct
---- reality - you'll be correcting effectively 500-1,500 pages worth
o OCR is very slow
---- it takes a good P4 to get a decent per page cycle time

You can also scan to *.pdf as an option - can be useful.

If you can spread the job over a decent period of time, it will be
less painful - particularly if you are planning on doing OCR.

Google carefully into various OCR options if going that route.
 
Dorothy Bradbury said:
Your problem is 1) speed & 2) particularly time if needing OCR.

Price...
o For 15,000 pages people tend to use Document Scanners
---- they are loose-sheet-fed - not book or stapled
o Unfortunately 300-400$US rules out such scanners
---- Ebay is one way, but repair/warranty an issue

That leaves you wish a flatbed scanner:
o You need to identify the fastest for your application
---- dpi - anything will do, you only need 200dpi, 300dpi is ideal
---- lps - lines per second is an important figure re speed
o Some scanners have a slow between scan cycle time
---- some Canon have very fast scanning speed re lps
---- however they go thro a light-cycle before each scan
---- so actually reducing the pages per hour you can do

Time...
o For 15,000 pages a typical fast flatbed is 300 A4/hour
o So you are looking at 50hrs, 50 days doing 1hrs a day
o Bear in mind that 1hr will be intensively numbingly boring

OCR...
o OCR is still only about 89-92% accurate
---- good - until you realise how many words you need to correct
---- reality - you'll be correcting effectively 500-1,500 pages worth
o OCR is very slow
---- it takes a good P4 to get a decent per page cycle time

You can also scan to *.pdf as an option - can be useful.

If you can spread the job over a decent period of time, it will be
less painful - particularly if you are planning on doing OCR.

Google carefully into various OCR options if going that route.

Thanks to all for the help, particularly to you Dorothy.
Now I have just one more question.
What exact scanner models, in your opinion, are the most suitable
to do the job (say couple of devices, not too expensive, from Canon, HP, etc).
 
Back
Top