Is there any OCR software that can cope with complex page layouts

  • Thread starter Thread starter dave.logan1
  • Start date Start date
D

dave.logan1

I have Abbey Finereader Pro 7, and although it's good on text
recognition, it's awful at reproducing the original layout when you
save the results as a PDF or a Word file. Here are two example of
documents it completely messes up when saving in PDF or Word format:
http://tinyurl.com/2fc3so and http://tinyurl.com/22parz

Are there any OCR packages out there that can cope with documents like
these, in terms of reproducing the original layour accurately when
saving in PDF or Word format?

Dave
 
I have Abbey Finereader Pro 7, and although it's good on text
recognition, it's awful at reproducing the original layout when you
save the results as a PDF or a Word file. Here are two example of
documents it completely messes up when saving in PDF or Word format:
http://tinyurl.com/2fc3so and http://tinyurl.com/22parz

Are there any OCR packages out there that can cope with documents like
these, in terms of reproducing the original layour accurately when
saving in PDF or Word format?

Dave


Preserving format is the hardest bit of OCR. It is never perfect.

Recent versions on OmniPage seem to be moving towards better quality.

I doubt if you will ever get to a stage where you don't have to do
some tweaking of the output.

MK
 
Hi MK

On 6 Apr 2007 23:08:02 -0700, (e-mail address removed) wrote:
Preserving format is the hardest bit of OCR. It is never perfect.

Recent versions on OmniPage seem to be moving towards better quality.

I doubt if you will ever get to a stage where you don't have to do
some tweaking of the output.

MK

Would you say the latest versions of Omnipage are sufficiently better
than Abbyy in this respect to justify the price of buying Onmi when I
already have Abbyy?

With Abby, even an ordinary longish letter takes me half an hour to
tweak in Word after OCRing it (adjusting the page margins, font sizes,
spece between paras, etc.), and I find that complex documents are
actually quicker to retype from scratch than to scan with OCR using
Abby. Nevertheless I don't have money to throw around and Omni would
have to improve things dramatically in this respect to justify the
price of switching over.

Dave
 
Hi MK




Would you say the latest versions of Omnipage are sufficiently better
than Abbyy in this respect to justify the price of buying Onmi when I
already have Abbyy?

-----End Quoted (and cut) Message-----


No idea. I have only ever used OmniPage, and I am one version behind
on that.

To my mind it is a mistake to expect to get perfect OCR into Word. The
latter is such a pig in how it handles things.

If the OCR gets the words right, I would investigate other ways of
improving the file's format, like writing macros to find and replace
font codes, for example. Then there are the styles that you can create
in Word.

This might take time, but set up a global system like this and you
won't have to spend so much effort on individual files.

Don't forget, using a scanner is itself throwing in a lot of
variables. How does the software know whether or not it should use
11pt or 12pt?

MK
 
To my mind it is a mistake to expect to get perfect OCR into Word. The
latter is such a pig in how it handles things.

Not if you understand how Word works. The Abbyy programmers don't. And
their PDFs are also pretty useless. In fact the reason I have to
export from Abbyy to Word rather than to PDF is so I can get it right
in Word prior to converting the Word file into a PDF. If they could
produce a perfect PDF from Abby I wouldn't need to export to Word
first.
If the OCR gets the words right, I would investigate other ways of
improving the file's format, like writing macros to find and replace
font codes, for example. Then there are the styles that you can create
in Word.

The macros would have to have access to the OCR output code for the
original graphics file, otherwise it could not possibly know how the
original document was supposed to look. In fact that is exactly what
Abbyy *have* done, they have written a Word macro to do just what you
say, but they've made a mess of it. If I worked for Abby I could
certainly do a much better job of writing the macro than they have
done, but I don't work for them.
Don't forget, using a scanner is itself throwing in a lot of
variables. How does the software know whether or not it should use
11pt or 12pt?

By measuring the distance from the top to the bottom and from the left
to the right of the characters. I'm not saying it's non-trivial but it
could be done, but only from within the OCR package because it would
have to analyse the original graphics file in great detail.

In fact they do generally get the point sizes right, but not the page
margins, line spacing, inter-paragraph spacing, and many other aspects
of the layout. And they use silly techniques that show they don't know
Word very well, like using columns when they should use borderless
tables, drawing lines when they should use tables, using exact line
spacing when they should use single or multiple line spacing, and I
could give many other examples.

Dave
 
You know all the answers so why ask the questions?

MK

The question I don't know the answer to is, is there any OCR software
that can cope with complex page layouts?

Dave
 
On 16 Apr, 17:57, (e-mail address removed) wrote:
> You know all the answers so why ask the questions?
>
> MK


The question I don't know the answer to is, is there any OCR software
that can cope with complex page layouts?

Dave

Hi just found this thread since I had the same problem. What I am usually doing as work around is. Creating both a plaintext file and a searchable PDF or DjVu document from my scanned stuff. These documents preserve the original Page structure. A good place to do so is ocrgeek.com for example..
 
Back
Top