Macro to clean up hyphens in OCR documents

  • Thread starter Thread starter Steve Wright
  • Start date Start date
S

Steve Wright

I am struggling with a problem that I suspect other people have had. I have
many large documents that were converted to Word via scanning and OCR. Most
of these documents contain numerous misplaced hyphens, due to hyphens at ends
of lines in the original source. So there are errors like "mecha-nism,"
"be-came," etc. I'm aware that this can be dealt with during the conversion
process but I wasn't involved in that.

I've asked around locally and people have advised me to either start over
from scratch and rescan everything or fix the errors one by one. However, I
am convinced that this could be solved in batch mode with a macro that looks
for words with embedded hyphens and invokes the spellchecker. I have never
written a macro before, so before I start learning I thought I would look to
see if anyone else had done this.

Has anyone dealt with this situation?
 
Why don't you just use Find/Replace to delete all the hyphens? (Ctrl-
H, type a hyphen in the upper box, type nothing in the lower box, and
click Replace All.)
 
I did consider doing that, and tried it with one of the documents.
Unfortunately these documents also contain a lot of hyphens that are supposed
to be there -- phrases like "tape-recorded," "split-second," etc. There are
about as many "good" hyphens as "bad" ones.
 
A macro would have the same problem distinguishing between the required
hyphens and the ones to be discarded. It would be quicker to use Replace and
page through them one at a time.

--
<>>< ><<> ><<> <>>< ><<> <>>< <>><<>
Graham Mayor - Word MVP

My web site www.gmayor.com

<>>< ><<> ><<> <>>< ><<> <>>< <>><<>
 
I'm wondering if it would be possible with a macro that invokes the
spellchecker on hyphenated words. "Tape-recorded" would pass spellcheck;
"mecha-nism" would not, and the macro would then remove the hyphen.

I have over a thousand pages to do, so I'm not eager to fix these one at a
time.
 
Steve,

This macro should get you most of the way to what you want. I don't have a
big enough sample to test all the possible cases, and I suspect the macro
might be a bit overzealous.

Sub ZapHyphens()
Dim oDoc As Document
Dim oRg As Range
Dim spErrs As ProofreadingErrors
Dim idx As Long
Dim oldOpt As Boolean

oldOpt = Options.CheckSpellingAsYouType
Options.CheckSpellingAsYouType = True

Set oDoc = ActiveDocument
oDoc.SpellingChecked = False
Set spErrs = oDoc.SpellingErrors

For idx = spErrs.Count To 1 Step -1
Set oRg = spErrs(idx)
oRg.Text = Replace(oRg.Text, "-", "")
Next

For idx = spErrs.Count To 1 Step -1
Set oRg = spErrs(idx)
If (oRg.Characters.First.Previous = "-") Then
oRg.Characters.First.Previous.Delete
ElseIf (oRg.Characters.Last.Next = "-") Then
oRg.Characters.Last.Next.Delete
End If
Next

Options.CheckSpellingAsYouType = oldOpt
oDoc.SpellingChecked = False
End Sub

The second For loop is required because Word sometimes recognizes the
misspelled "words" on either side of a hyphen but doesn't include the hyphen
itself in the range of the spelling error. For example, it sees
'tape-recor-ded' as containing two errors for 'recor' and 'ded' instead of
one error for 'recor-ded'. The loop eliminates those hyphens.

The problem is that if there is an actual misspelling preceded or followed
by a legitimate hyphen, the macro removes the hyphen. In this example,
'tape-recoir-ded' is changed to 'taperecoirded'. It might be possible to fix
this with some additional complication of the macro.

Another problem that you might run into with such a large document is that
the spelling checker may decide at some point that there are too many
corrections, or that the document is too complex, and throw an error
message. The only cure I know for that is to divide the document into
smaller parts and run the macro against each one separately.

--
Regards,
Jay Freedman
Microsoft Word MVP
Email cannot be acknowledged; please post all follow-ups to the newsgroup so
all may benefit.
 
WOW. This macro is amazing! It does exactly what I wanted. I just tested
it on a 150-page document, and it cleaned up the hyphen problems nicely!

Thanks so much! I really appreciate this!
 
You're welcome. But pay attention to the caveat I wrote after the macro; you
will have to rerun the spell-check and at least skim some of the document,
because the macro will delete some hyphens that it shouldn't.
 
I was wondering how you are going with that macro ?

I have a different approach using multiple macros.
first i prep the page with a macro that adds a "pipes" character to the end of each line.
then I make a macro to search for each "-|" [hyphen-pipes] conjugation and replace it with null & delete space and line break to join hyphenated word together.

cheers
 
Back
Top