Word Parsing based on Selection.Style

  • Thread starter Thread starter STeve
  • Start date Start date
S

STeve

Hey guys,

I currently have a 100 page word document filled with various
"articles". These articles are delimited by the Style of the text
(IE. Heading 1 for the various titles) These articles will then be
converted into HTML and saved. I want to write a parser through
vb.net that uses the word object model and was wondering how this
could be achieved? The problem i am running into is that i can not
test whether the selected text is of a certain style. My code so far
in vb.net:

Dim dc As Word.Document
Dim w As Word.Application
w = New Word.Application()

Dim arguments As [String]() = Environment.GetCommandLineArgs()


dc = w.Documents.Open("c:\static\test.doc")
w.Visible = True

w.Selection.EndKey(Word.WdUnits.wdStory)
w.Selection.HomeKey(Word.WdUnits.wdStory,
Word.WdKey.wdKeyShift)

Dim count As Integer
count = w.Selection.Range.ComputeStatistics(Word.WdStatistic.wdStatisticLines)

Dim i As Integer

For i = 0 To 2
w.Selection.HomeKey(Word.WdUnits.wdLine)

If w.Selection.Style = "Heading 3" Then
MsgBox("heading 1")
End If

w.Selection.MoveDown(Word.WdUnits.wdLine, 1)
Next



I currently have something like this in a macro in the word document,
note this is just a prototype, saveHTML simply copies and pastes the
selected text from one document to a new document and saves that
content as an HTML file due to the fact that microsoft word is unable
to "save to html" highlighted selections:

Sub ParseDocument()
Selection.HomeKey Unit:=wdStory
Selection.EndKey Unit:=wdStory
Selection.HomeKey Unit:=wdStory, Extend:=wdExtend
count1 = Selection.Range.ComputeStatistics(Statistic:=wdStatisticLines)

For x = 0 To count1
Selection.HomeKey Unit:=wdLine

If Selection.Style = "Heading 3" Then
saveHTML (file)
ElseIf Selection.Style = "Heading 2" Then
Selection.EndKey Unit:=wdLine, Extend:=wdExtend
z = Selection.Text
End If

If Selection.Style = "Heading 1" Then
file = path + y
saveHTML (file)

Selection.EndKey Unit:=wdLine, Extend:=wdExtend
y = Selection.Text

y = Replace(y, Chr(10), "")
y = Replace(y, Chr(13), "")
y = Replace(y, Chr(12), "")

y1 = y
End If

Selection.MoveDown Unit:=wdLine, count:=1
Next

file = path + y
saveHTML (file)
End Sub

Any help is appreciated, thanks guys
Steve
 
Hi Steve,

The easiest way to check the styles in the document is to iterate through
each paragraph in the document, rather than each line. I believe that a
style applies to the entire paragraph, so it's shouldn't be possible for
different lines in the same paragraph to have different styles.

Also, although it's not too important, it's usually preferable to use the
Range object instead of the Selection object when using the Word object
model -- the Range object gives you greater options, and is invisible
(doesn't cause screen flicker.)

Here's some quick code that iterates through each paragraph and prints the
style name:

Sub Test()

Dim p As Paragraph

For Each p In w.ActiveDocument.Paragraphs
Debug.Writeline(p.Style & ": " & p.Range.Text)
Next p

End Sub

Does this help? If not, let me know more specifically where you're having
trouble.





STeve said:
Hey guys,

I currently have a 100 page word document filled with various
"articles". These articles are delimited by the Style of the text
(IE. Heading 1 for the various titles) These articles will then be
converted into HTML and saved. I want to write a parser through
vb.net that uses the word object model and was wondering how this
could be achieved? The problem i am running into is that i can not
test whether the selected text is of a certain style. My code so far
in vb.net:

Dim dc As Word.Document
Dim w As Word.Application
w = New Word.Application()

Dim arguments As [String]() = Environment.GetCommandLineArgs()


dc = w.Documents.Open("c:\static\test.doc")
w.Visible = True

w.Selection.EndKey(Word.WdUnits.wdStory)
w.Selection.HomeKey(Word.WdUnits.wdStory,
Word.WdKey.wdKeyShift)

Dim count As Integer
count = w.Selection.Range.ComputeStatistics(Word.WdStatistic.wdStatisticLines)

Dim i As Integer

For i = 0 To 2
w.Selection.HomeKey(Word.WdUnits.wdLine)

If w.Selection.Style = "Heading 3" Then
MsgBox("heading 1")
End If

w.Selection.MoveDown(Word.WdUnits.wdLine, 1)
Next



I currently have something like this in a macro in the word document,
note this is just a prototype, saveHTML simply copies and pastes the
selected text from one document to a new document and saves that
content as an HTML file due to the fact that microsoft word is unable
to "save to html" highlighted selections:

Sub ParseDocument()
Selection.HomeKey Unit:=wdStory
Selection.EndKey Unit:=wdStory
Selection.HomeKey Unit:=wdStory, Extend:=wdExtend
count1 = Selection.Range.ComputeStatistics(Statistic:=wdStatisticLines)

For x = 0 To count1
Selection.HomeKey Unit:=wdLine

If Selection.Style = "Heading 3" Then
saveHTML (file)
ElseIf Selection.Style = "Heading 2" Then
Selection.EndKey Unit:=wdLine, Extend:=wdExtend
z = Selection.Text
End If

If Selection.Style = "Heading 1" Then
file = path + y
saveHTML (file)

Selection.EndKey Unit:=wdLine, Extend:=wdExtend
y = Selection.Text

y = Replace(y, Chr(10), "")
y = Replace(y, Chr(13), "")
y = Replace(y, Chr(12), "")

y1 = y
End If

Selection.MoveDown Unit:=wdLine, count:=1
Next

file = path + y
saveHTML (file)
End Sub

Any help is appreciated, thanks guys
Steve
 
You might try this article. Although it transforms Word documents to XML
based on styles, it will be trivial to change it to output HTML instead.
Alternatively, consider outputting XML and using a stylesheet to transform
the result to HTML--the resulting XML document is reusable, and you'll find
that's a lot more flexible when you want to make changes to the HTML.

http://www.devx.com/dotnet/Article/17358
 
Hey Robert,

Thanks for the quick reply and assisstance. This doesn't help me out
too much, I tried this code out but what is simplyl does is parse the
document paragraph by paragraph which doesn't necessarily work out in
my situation. For example say I have a document which looks like
this:

Heading1

para1
para2
para3

Heading3
Heading2

para1
para2

Heading1

para1
para2
para3

The parsing macro i wrote before simply parses this document line by
line looking for the various styles heading1 and heading3. So when it
comes across a new heading, i do a selection.HomeKey back to the
beginning of the document, cut that article out and paste it into a
new document and save that article as HTML. So basically for the
first article I would save as filename Heading1:

Heading1

para1
para2
para3

The next article would be (filename would be Heading3_Heading2):

Heading3
Heading2

para1
para2

and so on...

What I am thinking of doing in vb.net is using the Selection.Find
command first on "Heading 1" style. Cut that entire "article" out and
paste it into a new document. THen do another Selection.Find now on
"heading3" and then cut and paste that article into a new document
then finally save it as HTML. Is there a more efficient/elegant way
of doing this? Thanks for your time guys.

Thanks in advance
Steve
 
Well, there isn't a single best way to do this. In my experience, the Find
object is a little difficult to use. My approach would be a bit
different -- iterating through each paragraph, testing the paragraph's style
using the Style property, and using a Range object to copy a story. Here's
some pseudo code:


Dim StoryStart, StoryEnd as Integer
Dim CurrentParagraph as Word.Paragraph
Dim StoryRange as Word.Range

' Iterate through each paragraph in the document.
For each CurrentParagraph in w.ActiveDocument.Paragraphs

If CurrentParagraph.Style = (the start of a new story) then
' Define a range object for the previous story and copy it to the
clipboard
StoryEnd = NextParagraph.Start - 1
StoryRange = w.ActiveDocument.Range(CObj(StoryStart),
CObj(StoryEnd))
StoryRange.Copy
(Paste the code into a new document)
' Reset the StoryStart counter to the start of the next story
StoryStart = CurrentParagraph.Start
End If

Next CurrentParagraph


This is just air code, but hopefully gives you the idea. The key line is
the "If" test -- you need to insert some code to detect whether the
paragraph is the start of a new story.

Hope this helps.
 
Correction... the line

StoryEnd = NextParagraph.Start - 1

Should be

StoryEnd = CurrentParagraph.Start -1
 
Back
Top