D
dskeeles
Hi,
For some years now, I've been scanning all my old documents for
archiving purposes. Bank statements, old conference notes, invoices,
stapled receipts, warranties, work logbooks... anything that needs
storing, or could be useful in the future, but not in the present. With
storage ever-shrinking, it takes a lot less space than the physical
papers. Also, my reasoning was that at some point in the future, OCR
and computers would progress enough that I'd be able to turn them into
searchable documents. But over the years, I've never found a solution
that gave good text recognition, with no loss of information.
Now, finally, I seem to have found a solution, and surprisingly, it's
Microsoft .MDI format....
It seems ideal. With .MDI and Office Imaging 2003, the focus is still
on the scanned image, hence the original image is retained and no
information is lost; however, using the OCR, Imaging recognises all the
text that it can, and stores this within the same file. If
drag-selecting over an image, the underlying text is selected, and can
be copied and pasted into documents; and is also searchable.
I've tried many other apps over the years. Acrobat Distiller into PDF
is OK, but where it recognises a word, including those where it's
wrong, it replaces the image with the text - the result on anything
other than clear documents, is a mess. MS Imaging, on the other hand,
does not modify the original image.
Abbey Finereader is excellent at it's job, but is also focussed at
converting, then discarding the original, and so losing any
unintelligable information. Other apps are similar.
With .MDI - the file size seems comparable to Paperport MAX, which I
currently use, as it was the best non-lossy compression I could find at
the time (200dpi Greyscale is my standard archiving format). The text
'subchannel' seems to add little extra overhead to this. The emphasis
is on the original image, which renders quite quickly, but the text can
be searched by a Windows Explorer 'Containing text...' find. And after
a few comparisons, I've found that the Imaging OCR is very comparable
with Finereader 7, apparently one of the best current OCR engines - and
usually has no problems with 'reasonably' clear text.
I just want to ask; am I missing something? Are there other commercial
apps out there (for home users) that provide this functionality, but
better? .MDI format seems not to have taken off at all - no third-party
apps support it, and there's hardly any discussion of it. A key need
for me is to be able to index and search these files alongside my other
documents; which makes me wonder if Google Desktop search, Blinkx, etc.
do or will support it? (I use Deductus to index my hundreds of CDs,
DVDs, and data sources, and that doesn't support it either...)
Cheers,
Damian
For some years now, I've been scanning all my old documents for
archiving purposes. Bank statements, old conference notes, invoices,
stapled receipts, warranties, work logbooks... anything that needs
storing, or could be useful in the future, but not in the present. With
storage ever-shrinking, it takes a lot less space than the physical
papers. Also, my reasoning was that at some point in the future, OCR
and computers would progress enough that I'd be able to turn them into
searchable documents. But over the years, I've never found a solution
that gave good text recognition, with no loss of information.
Now, finally, I seem to have found a solution, and surprisingly, it's
Microsoft .MDI format....
It seems ideal. With .MDI and Office Imaging 2003, the focus is still
on the scanned image, hence the original image is retained and no
information is lost; however, using the OCR, Imaging recognises all the
text that it can, and stores this within the same file. If
drag-selecting over an image, the underlying text is selected, and can
be copied and pasted into documents; and is also searchable.
I've tried many other apps over the years. Acrobat Distiller into PDF
is OK, but where it recognises a word, including those where it's
wrong, it replaces the image with the text - the result on anything
other than clear documents, is a mess. MS Imaging, on the other hand,
does not modify the original image.
Abbey Finereader is excellent at it's job, but is also focussed at
converting, then discarding the original, and so losing any
unintelligable information. Other apps are similar.
With .MDI - the file size seems comparable to Paperport MAX, which I
currently use, as it was the best non-lossy compression I could find at
the time (200dpi Greyscale is my standard archiving format). The text
'subchannel' seems to add little extra overhead to this. The emphasis
is on the original image, which renders quite quickly, but the text can
be searched by a Windows Explorer 'Containing text...' find. And after
a few comparisons, I've found that the Imaging OCR is very comparable
with Finereader 7, apparently one of the best current OCR engines - and
usually has no problems with 'reasonably' clear text.
I just want to ask; am I missing something? Are there other commercial
apps out there (for home users) that provide this functionality, but
better? .MDI format seems not to have taken off at all - no third-party
apps support it, and there's hardly any discussion of it. A key need
for me is to be able to index and search these files alongside my other
documents; which makes me wonder if Google Desktop search, Blinkx, etc.
do or will support it? (I use Deductus to index my hundreds of CDs,
DVDs, and data sources, and that doesn't support it either...)
Cheers,
Damian