Format for Searchable Document Storage - is .MDI the answer?

  • Thread starter Thread starter dskeeles
  • Start date Start date
D

dskeeles

Hi,

For some years now, I've been scanning all my old documents for
archiving purposes. Bank statements, old conference notes, invoices,
stapled receipts, warranties, work logbooks... anything that needs
storing, or could be useful in the future, but not in the present. With
storage ever-shrinking, it takes a lot less space than the physical
papers. Also, my reasoning was that at some point in the future, OCR
and computers would progress enough that I'd be able to turn them into
searchable documents. But over the years, I've never found a solution
that gave good text recognition, with no loss of information.

Now, finally, I seem to have found a solution, and surprisingly, it's
Microsoft .MDI format....


It seems ideal. With .MDI and Office Imaging 2003, the focus is still
on the scanned image, hence the original image is retained and no
information is lost; however, using the OCR, Imaging recognises all the
text that it can, and stores this within the same file. If
drag-selecting over an image, the underlying text is selected, and can
be copied and pasted into documents; and is also searchable.

I've tried many other apps over the years. Acrobat Distiller into PDF
is OK, but where it recognises a word, including those where it's
wrong, it replaces the image with the text - the result on anything
other than clear documents, is a mess. MS Imaging, on the other hand,
does not modify the original image.

Abbey Finereader is excellent at it's job, but is also focussed at
converting, then discarding the original, and so losing any
unintelligable information. Other apps are similar.


With .MDI - the file size seems comparable to Paperport MAX, which I
currently use, as it was the best non-lossy compression I could find at
the time (200dpi Greyscale is my standard archiving format). The text
'subchannel' seems to add little extra overhead to this. The emphasis
is on the original image, which renders quite quickly, but the text can
be searched by a Windows Explorer 'Containing text...' find. And after
a few comparisons, I've found that the Imaging OCR is very comparable
with Finereader 7, apparently one of the best current OCR engines - and
usually has no problems with 'reasonably' clear text.


I just want to ask; am I missing something? Are there other commercial
apps out there (for home users) that provide this functionality, but
better? .MDI format seems not to have taken off at all - no third-party
apps support it, and there's hardly any discussion of it. A key need
for me is to be able to index and search these files alongside my other
documents; which makes me wonder if Google Desktop search, Blinkx, etc.
do or will support it? (I use Deductus to index my hundreds of CDs,
DVDs, and data sources, and that doesn't support it either...)
Cheers,


Damian
 
dskeeles said:
Hi,

For some years now, I've been scanning all my old documents for
archiving purposes. Bank statements, old conference notes, invoices,
stapled receipts, warranties, work logbooks... anything that needs
storing, or could be useful in the future, but not in the present. With
storage ever-shrinking, it takes a lot less space than the physical
papers. Also, my reasoning was that at some point in the future, OCR
and computers would progress enough that I'd be able to turn them into
searchable documents. But over the years, I've never found a solution
that gave good text recognition, with no loss of information.

Now, finally, I seem to have found a solution, and surprisingly, it's
Microsoft .MDI format....


It seems ideal. With .MDI and Office Imaging 2003, the focus is still
on the scanned image, hence the original image is retained and no
information is lost; however, using the OCR, Imaging recognises all the
text that it can, and stores this within the same file. If
drag-selecting over an image, the underlying text is selected, and can
be copied and pasted into documents; and is also searchable.

I've tried many other apps over the years. Acrobat Distiller into PDF
is OK, but where it recognises a word, including those where it's
wrong, it replaces the image with the text - the result on anything
other than clear documents, is a mess. MS Imaging, on the other hand,
does not modify the original image.

Abbey Finereader is excellent at it's job, but is also focussed at
converting, then discarding the original, and so losing any
unintelligable information. Other apps are similar.


With .MDI - the file size seems comparable to Paperport MAX, which I
currently use, as it was the best non-lossy compression I could find at
the time (200dpi Greyscale is my standard archiving format). The text
'subchannel' seems to add little extra overhead to this. The emphasis
is on the original image, which renders quite quickly, but the text can
be searched by a Windows Explorer 'Containing text...' find. And after
a few comparisons, I've found that the Imaging OCR is very comparable
with Finereader 7, apparently one of the best current OCR engines - and
usually has no problems with 'reasonably' clear text.


I just want to ask; am I missing something? Are there other commercial
apps out there (for home users) that provide this functionality, but
better? .MDI format seems not to have taken off at all - no third-party
apps support it, and there's hardly any discussion of it. A key need
for me is to be able to index and search these files alongside my other
documents; which makes me wonder if Google Desktop search, Blinkx, etc.
do or will support it? (I use Deductus to index my hundreds of CDs,
DVDs, and data sources, and that doesn't support it either...)
Cheers,


Damian
You have found a good solution, but you are missing one of the modes in
Acrobat PDF, there is a mode where the image is stored and the text is
OCR'ed and both are stored in the file.
http://www.adobe.com/products/acrobat/matrix.html
Acrobat Professional is the most expensive, but also does the most things.

If the text is not readable or incorrect the image is available to compare
to the OCR'ed result.

From Adobe Tips & Tutorials:
If you want to be able to edit the text you've scanned into Acrobat, you can
convert an Adobe PDF image-only file to one of three formats:

Formatted Text and Graphics, used for most standard PDF files, replaces
bitmapped text with editable text in actual fonts that look similar to the
ones in the original document.

Searchable Image (Exact) retains the bitmapped appearance of the original
document, and the searchable text is supplied on an invisible layer below
the bitmap.

Searchable Image (Compact) segments the original image to allow different
areas to be compressed, sacrificing image quality but resulting in a smaller
file.

The second option (Searchable Image) stores both the image and the OCR'ed
text.
 
Thanks - I'll take a look. This dovetails nicely with Scansoft
Paperport 10, which uses PDF as a native format, and the package is the
same as I used to use for managing these documents (abeit version 4, in
..MAX format). On top of that, Deductus supports PDF scanning and
indexing.

My one concern is file size - I'll see how PDF compares against MDI for
high quality image storage... and also see how the OCR compares between
Imaging 2003 and Paperport 10... assuming Scansoft have a trial
edition...


Damian
 
dskeeles said:
I just want to ask; am I missing something? Are there other commercial
apps out there (for home users) that provide this functionality, but
better? .MDI format seems not to have taken off at all - no third-party
apps support it, and there's hardly any discussion of it. A key need
for me is to be able to index and search these files alongside my other
documents; which makes me wonder if Google Desktop search, Blinkx, etc.
do or will support it? (I use Deductus to index my hundreds of CDs,
DVDs, and data sources, and that doesn't support it either...)

Aside from the fact that you've missed a .pdf mode that does what you
want, another question you're missing is "will I still be able to read
files of this format in the future when I need to access them?". Not
only the question of will the index-and-search applications be there,
but the question of whether a basic reader itself will still be around
ten or twenty years in the future.

There are those who would claim that, in order to have some assurance of
being able to read the file in the future, you need to use a format
that's defined by openly-published specifications. I'm not sure of the
truth of this claim, but it's worth considering. Is .mdi an open
specification like .pdf?

- Brooks
 
Brooks said:
Aside from the fact that you've missed a .pdf mode that does what you
want, another question you're missing is "will I still be able to read
files of this format in the future when I need to access them?". Not
only the question of will the index-and-search applications be there,
but the question of whether a basic reader itself will still be around
ten or twenty years in the future.

There are those who would claim that, in order to have some assurance of
being able to read the file in the future, you need to use a format
that's defined by openly-published specifications. I'm not sure of the
truth of this claim, but it's worth considering. Is .mdi an open
specification like .pdf?

That is a very good point, and I would admit that MDI doesn't seem to
have much apparent support yet, from looking through websites and
common Document Management Applications.

However - I'm reluctant to ditch it, because I've found the OCR in
Microsoft Imaging appears to be better than others (ie. Acrobat 7,
Paperport) in certain circumstances - especially, when scanning
'non-standard' documents with various noise or clutter in the
background. For example: various receipts or boarding passes, stapled
in various orientations on a piece of A4 and then scanned. Since I have
a large number of these, it's useful to have this feature.

True, I could save in Multi-page LZW TIFF, which Imaging also supports,
but it raises the size 5/10-fold.

Anyway - in the end, I decided to run a full quantitative comparison
between Acrobat, Paperport, and Imaging, on 18 different samples of
typical types of documents that I have (technical notes, training
notes, shop receipts, letters, bank statements, boarding passes, etc),
and see how they fared. The main comparison was in picking out 5-6
words from each that I would usually like to search for (eg. finding a
receipt by the item or company name, or technical keyword from a
manual) - and then trying to search for these using whatever OCR and
search engine the software provided. In the end, I concluded:

- For Training Manuals, Reference Guides, letters, contracts, and other
'organised' documents - Acrobat 7 OCR is most accurate. The resultant
PDF (lossy, but at maximum quality) is typically 50% smaller in size
than .MDI or (lossless) .MAX, and OCR is typically 30% faster than
Imaging.
- For Receipts, Notes, documents with convoluted backgrounds or very
faint text, and other miscellaneous documents - Microsoft Imaging is
most accurate. The MDI format (also lossy), with text, is typically
60-75% smaller than the original non-text LZW .TIF (actually not as
small as I was expecting).

So - I'm planning to split my documents into one camp or the other (I
tend to handle them in different ways anyway due to their nature), and
process each accordingly. Still: before I process them, I will take
another look to see if there's anything that can handle the receipts
more effectively, and still store them in PDF.


Damian
 
Back
Top