I have many scanned image files (PDFs at the moment but could change
if needed) and would like to rename them based on a unique serial
number that appears in the same place at the top right hand corner of
each image. The serial number is printed on the original document so
presumably there should be no trouble with software reading it
reliably.
I see you haven't had any experience with OCR. There is *no* OCR engine
in existence that gets every single letter/number right 100% of the
time.[0] If you know that certain characters won't appear in this
unique number, you can generally tell your OCR engine to ignore those
characters, and accuracy will go up a bit--but it'll never be 100%.
Does anyone know of any software that could do this?
This assumes that the serial numbers are always in a similar place, that
no real skew is present, and that the numbers are typeset. If the
numbers are handwritten, skewed, and/or in different places on each
document, the approach outlined below will fail.
Store filename in OLDNAME. Use Ghostscript to turn the first page of
each PDF into a TIFF at 300 DPI. Use ImageMagick to crop out the area
of each page where the serial# appears, making sure to add about 100
pixels of slop into the cropped area. Take resulting TIFF, feed it to
OCR engine, store textual results in NEWNAME. Check NEWNAME for
correctness; no multiline strings or empty strings are allowed.
if(correct) ; then mv $OLDNAME $NEWNAME ; else echo "$OLDNAME had OCR
problems, leaving it as is" ; fi. A shell script that did all this with
gocr or ocrad would be fairly easy to put together.
Doing this using 'DozeXP would be slightly more difficult, since that
doesn't ship with a good shell and 'Doze OCR engines aren't quite as
amenable to scripting. You could probably patch something together
using ActiveState Perl and the apps I mentioned above (GhostScript and
ImageMagick are Free for all OSes), though. HTH,
[0] The account numbers on checks are printed using a special font that
was designed to be easily OCRed. The Post Orifice uses a gigantic
database of valid addresses; if the OCR results from 1 item don't match
any of those addresses, the item gets shunted aside and looked at by a
human.