Reading XML Documents

Just Me · Dec 31, 2006

Does anyone have code or can point to usefull snippets to allow me to
traverse the xml "Elements" of an xmlDocument.

What I want to do is to move through the entire document and when I hit
<table> <tr> <td> clear the attributes for these Elements and also remove
all Elelments NOT nested inside a <table>

I'm sure my head is going to explode soon :-(

Cheers

Stephany Young · Dec 31, 2006

Now that you're starting to get on the right track, you will find that the
documentation on the XmlDocument class has extensive examples of what you
are looking for.

Assuming, (and that might be dangerous), that the HTML source that you are
dealing with is 'well-formed' in terms of XML, then you can, quite happily,
deal with the source as an XmlDocument.

If, it is not 'well-formed' then you are back to square one.

Martin Honnen · Jan 1, 2007

Just said:
Does anyone have code or can point to usefull snippets to allow me to
traverse the xml "Elements" of an xmlDocument.

What I want to do is to move through the entire document and when I hit
<table> <tr> <td> clear the attributes for these Elements and also remove
all Elelments NOT nested inside a <table>

If you use System.Xml.XmlDocument and SelectNodes then you have a
powerful tool to select the nodes you are looking for, then you can use
the DOM methods to remove nodes. Example to remove all attributes on
table, tr, and td elements is like this

Dim XmlDoc As XmlDocument = New XmlDocument
XmlDoc.Load("XMLFile1.xml")
Console.WriteLine("Initial Document:")
XmlDoc.Save(Console.Out)
Console.WriteLine()
Dim AttributesToRemove As XmlNodeList = _
XmlDoc.SelectNodes("//table/@* | //tr/@* | //td/@*")
For I As Integer = AttributesToRemove.Count - 1 To 0 Step -1
Dim Attribute As XmlAttribute = _
CType(AttributesToRemove(I), XmlAttribute)
Attribute.OwnerElement.RemoveAttributeNode(Attribute)
Next

Console.WriteLine("Changed document:")
XmlDoc.Save(Console.Out)

Example output:

Initial Document:
<?xml version="1.0" encoding="ibm850"?>
<html lang="en">
<head>
<title>Example</title>
</head>
<body>
<table border="1" class="some-class" id="t1">
<tbody>
<tr class="odd">
<td id="cell1">
</td>
</tr>
</tbody>
</table>
</body>
</html>
Changed document:
<?xml version="1.0" encoding="ibm850"?>
<html lang="en">
<head>
<title>Example</title>
</head>
<body>
<table>
<tbody>
<tr>
<td>
</td>
</tr>
</tbody>
</table>
</body>
</html>

An XPath expression to select all elements inside of the document body
that are not nested in a table is e.g.
/html/body//*[not(ancestor-or-self::table)]

Just Me · Jan 1, 2007

Brilliant Martin !

Thanks for this post. I have tried it out and it works a treat. So I can use
this.

If I may one more question.?

The approach I was taking was to read the xmldocument into a stream and
instantiate an xmlreader and use the xmlreader.read method to go through the
document. The problem I had was that althought I could cycle through the
nodes, I couldnt determine how to read the node into an xmlNode from the
xmlreader. It doesent seem possible.

Any idea what would be the best approach for this ?

Many thanks

Martin Honnen said:
Just said:

Does anyone have code or can point to usefull snippets to allow me to
traverse the xml "Elements" of an xmlDocument.

What I want to do is to move through the entire document and when I hit
<table> <tr> <td> clear the attributes for these Elements and also
remove all Elelments NOT nested inside a <table>

Click to expand...

If you use System.Xml.XmlDocument and SelectNodes then you have a powerful
tool to select the nodes you are looking for, then you can use the DOM
methods to remove nodes. Example to remove all attributes on table, tr,
and td elements is like this

Dim XmlDoc As XmlDocument = New XmlDocument
XmlDoc.Load("XMLFile1.xml")
Console.WriteLine("Initial Document:")
XmlDoc.Save(Console.Out)
Console.WriteLine()
Dim AttributesToRemove As XmlNodeList = _
XmlDoc.SelectNodes("//table/@* | //tr/@* | //td/@*")
For I As Integer = AttributesToRemove.Count - 1 To 0 Step -1
Dim Attribute As XmlAttribute = _
CType(AttributesToRemove(I), XmlAttribute)
Attribute.OwnerElement.RemoveAttributeNode(Attribute)
Next

Console.WriteLine("Changed document:")
XmlDoc.Save(Console.Out)

Example output:

Initial Document:
<?xml version="1.0" encoding="ibm850"?>
<html lang="en">
<head>
<title>Example</title>
</head>
<body>
<table border="1" class="some-class" id="t1">
<tbody>
<tr class="odd">
<td id="cell1">
</td>
</tr>
</tbody>
</table>
</body>
</html>
Changed document:
<?xml version="1.0" encoding="ibm850"?>
<html lang="en">
<head>
<title>Example</title>
</head>
<body>
<table>
<tbody>
<tr>
<td>
</td>
</tr>
</tbody>
</table>
</body>
</html>

An XPath expression to select all elements inside of the document body
that are not nested in a table is e.g.
/html/body//*[not(ancestor-or-self::table)]

Martin Honnen · Jan 2, 2007

Just said:
The approach I was taking was to read the xmldocument into a stream and
instantiate an xmlreader and use the xmlreader.read method to go through the
document. The problem I had was that althought I could cycle through the
nodes, I couldnt determine how to read the node into an xmlNode from the
xmlreader. It doesent seem possible.

Any idea what would be the best approach for this ?

It is not clear what you want to do. If you want to load your complete
XML document into an System.Xml.XmlDocument instance then simply use the
Load method and pass in a file name or URL. There is no need to use an
XmlReader explictly.

If you have both an XmlDocument instance and an XmlReader and you want
to import data from the reader into the document then you can use the
ReadNode method
<http://msdn2.microsoft.com/en-us/library/system.xml.xmldocument.readnode.aspx>
to create a node owned by the XmlDocument instance from the node the
reader is positioned on. The XmlNode returned from ReadNode can then be
inserted into the XmlDocument instance with e.g. AppendChild or
InsertBefore called on the intended parent node.

Just Me · Jan 2, 2007

OK, thanks again for your help.

Martin Honnen said:
It is not clear what you want to do. If you want to load your complete XML
document into an System.Xml.XmlDocument instance then simply use the Load
method and pass in a file name or URL. There is no need to use an
XmlReader explictly.

If you have both an XmlDocument instance and an XmlReader and you want to
import data from the reader into the document then you can use the
ReadNode method
<http://msdn2.microsoft.com/en-us/library/system.xml.xmldocument.readnode.aspx>
to create a node owned by the XmlDocument instance from the node the
reader is positioned on. The XmlNode returned from ReadNode can then be
inserted into the XmlDocument instance with e.g. AppendChild or
InsertBefore called on the intended parent node.

Reading XML Documents

Just Me

Stephany Young

Martin Honnen

Just Me

Martin Honnen

Just Me