To my mind it is a mistake to expect to get perfect OCR into Word. The
latter is such a pig in how it handles things.
Not if you understand how Word works. The Abbyy programmers don't. And
their PDFs are also pretty useless. In fact the reason I have to
export from Abbyy to Word rather than to PDF is so I can get it right
in Word prior to converting the Word file into a PDF. If they could
produce a perfect PDF from Abby I wouldn't need to export to Word
first.
If the OCR gets the words right, I would investigate other ways of
improving the file's format, like writing macros to find and replace
font codes, for example. Then there are the styles that you can create
in Word.
The macros would have to have access to the OCR output code for the
original graphics file, otherwise it could not possibly know how the
original document was supposed to look. In fact that is exactly what
Abbyy *have* done, they have written a Word macro to do just what you
say, but they've made a mess of it. If I worked for Abby I could
certainly do a much better job of writing the macro than they have
done, but I don't work for them.
Don't forget, using a scanner is itself throwing in a lot of
variables. How does the software know whether or not it should use
11pt or 12pt?
By measuring the distance from the top to the bottom and from the left
to the right of the characters. I'm not saying it's non-trivial but it
could be done, but only from within the OCR package because it would
have to analyse the original graphics file in great detail.
In fact they do generally get the point sizes right, but not the page
margins, line spacing, inter-paragraph spacing, and many other aspects
of the layout. And they use silly techniques that show they don't know
Word very well, like using columns when they should use borderless
tables, drawing lines when they should use tables, using exact line
spacing when they should use single or multiple line spacing, and I
could give many other examples.
Dave