Is there any OCR software that can cope with complex page layouts

dave.logan1 · Apr 7, 2007

I have Abbey Finereader Pro 7, and although it's good on text
recognition, it's awful at reproducing the original layout when you
save the results as a PDF or a Word file. Here are two example of
documents it completely messes up when saving in PDF or Word format:
http://tinyurl.com/2fc3so and http://tinyurl.com/22parz

Are there any OCR packages out there that can cope with documents like
these, in terms of reproducing the original layour accurately when
saving in PDF or Word format?

Dave

Guest · Apr 7, 2007

I have Abbey Finereader Pro 7, and although it's good on text
recognition, it's awful at reproducing the original layout when you
save the results as a PDF or a Word file. Here are two example of
documents it completely messes up when saving in PDF or Word format:
http://tinyurl.com/2fc3so and http://tinyurl.com/22parz

Are there any OCR packages out there that can cope with documents like
these, in terms of reproducing the original layour accurately when
saving in PDF or Word format?

Dave

Preserving format is the hardest bit of OCR. It is never perfect.

Recent versions on OmniPage seem to be moving towards better quality.

I doubt if you will ever get to a stage where you don't have to do
some tweaking of the output.

MK

dave.logan1 · Apr 11, 2007

Hi MK

On 6 Apr 2007 23:08:02 -0700, (e-mail address removed) wrote:

Preserving format is the hardest bit of OCR. It is never perfect.

Recent versions on OmniPage seem to be moving towards better quality.

I doubt if you will ever get to a stage where you don't have to do
some tweaking of the output.

MK

Would you say the latest versions of Omnipage are sufficiently better
than Abbyy in this respect to justify the price of buying Onmi when I
already have Abbyy?

With Abby, even an ordinary longish letter takes me half an hour to
tweak in Word after OCRing it (adjusting the page margins, font sizes,
spece between paras, etc.), and I find that complex documents are
actually quicker to retype from scratch than to scan with OCR using
Abby. Nevertheless I don't have money to throw around and Omni would
have to improve things dramatically in this respect to justify the
price of switching over.

Dave

Guest · Apr 15, 2007

Hi MK

Would you say the latest versions of Omnipage are sufficiently better
than Abbyy in this respect to justify the price of buying Onmi when I
already have Abbyy?

-----End Quoted (and cut) Message-----

No idea. I have only ever used OmniPage, and I am one version behind
on that.

To my mind it is a mistake to expect to get perfect OCR into Word. The
latter is such a pig in how it handles things.

If the OCR gets the words right, I would investigate other ways of
improving the file's format, like writing macros to find and replace
font codes, for example. Then there are the styles that you can create
in Word.

This might take time, but set up a global system like this and you
won't have to spend so much effort on individual files.

Don't forget, using a scanner is itself throwing in a lot of
variables. How does the software know whether or not it should use
11pt or 12pt?

MK

dave.logan1 · Apr 15, 2007

To my mind it is a mistake to expect to get perfect OCR into Word. The
latter is such a pig in how it handles things.

Not if you understand how Word works. The Abbyy programmers don't. And
their PDFs are also pretty useless. In fact the reason I have to
export from Abbyy to Word rather than to PDF is so I can get it right
in Word prior to converting the Word file into a PDF. If they could
produce a perfect PDF from Abby I wouldn't need to export to Word
first.

If the OCR gets the words right, I would investigate other ways of
improving the file's format, like writing macros to find and replace
font codes, for example. Then there are the styles that you can create
in Word.

The macros would have to have access to the OCR output code for the
original graphics file, otherwise it could not possibly know how the
original document was supposed to look. In fact that is exactly what
Abbyy *have* done, they have written a Word macro to do just what you
say, but they've made a mess of it. If I worked for Abby I could
certainly do a much better job of writing the macro than they have
done, but I don't work for them.

Don't forget, using a scanner is itself throwing in a lot of
variables. How does the software know whether or not it should use
11pt or 12pt?

By measuring the distance from the top to the bottom and from the left
to the right of the characters. I'm not saying it's non-trivial but it
could be done, but only from within the OCR package because it would
have to analyse the original graphics file in great detail.

In fact they do generally get the point sizes right, but not the page
margins, line spacing, inter-paragraph spacing, and many other aspects
of the layout. And they use silly techniques that show they don't know
Word very well, like using columns when they should use borderless
tables, drawing lines when they should use tables, using exact line
spacing when they should use single or multiple line spacing, and I
could give many other examples.

Dave

Guest · Apr 16, 2007

You know all the answers so why ask the questions?

MK

dave.logan1 · Apr 16, 2007

You know all the answers so why ask the questions?

MK

The question I don't know the answer to is, is there any OCR software
that can cope with complex page layouts?

Dave

textix · Aug 24, 2014

[email protected] said:
On 16 Apr, 17:57, (e-mail address removed) wrote:
> You know all the answers so why ask the questions?
>
> MK

The question I don't know the answer to is, is there any OCR software
that can cope with complex page layouts?

Dave

Hi just found this thread since I had the same problem. What I am usually doing as work around is. Creating both a plaintext file and a searchable PDF or DjVu document from my scanned stuff. These documents preserve the original Page structure. A good place to do so is ocrgeek.com for example..

OCR of image PDF's from command line - any ideas?	6	Oct 5, 2005
Dual-core processor, speed of scanning and OCR, fast OmniPage 15, slow FineReader 8	0	Dec 7, 2005
Format for Searchable Document Storage - is .MDI the answer?	4	Jan 7, 2005
Preserving Excel Page Layouts when Linking to word	1	Mar 15, 2007
Word 2003: misbehaving formatting from OCR	3	Jan 26, 2005
Is there any way to save a document with a split view?	1	Mar 27, 2009
Huge problems with formatting	1	Nov 21, 2007
Overprinting in folded booklet	6	Apr 12, 2009

Is there any OCR software that can cope with complex page layouts

dave.logan1

Guest

dave.logan1

Guest

dave.logan1

Guest

dave.logan1

textix

Ask a Question

Similar Threads