XPath parsing of HTML files

  • Thread starter Thread starter Thomas Wieczorek
  • Start date Start date
T

Thomas Wieczorek

Hello!

I have to extract various product data from online shops, e.g. price,
title, description. I planned to create a form with a combobox to
choose the shops from, a Microsoft Browser to display the sites and
some labels/textboxes to display the extracted data. It is quite
simple so far.
Every shop has a different structure, so I thought, I could use XPath
expressions to point to the HTML tags, where I can find the product
data. Sounded easy to me, but System.XML.XPath doesn't works on non-
standard XML-Documents. So I added the HTMLTidy DLL to the project to
convert the HTML file to XHTML. XPathDocument accepts the XHTML file
now, but the XPathNavigator does not return anything.
My code:

<code>
' Test site is at AxWebBrowser.LocationURL =
' https://www.secomp.de/ishopWebFront.../para/node/is//and/product/is/15.20.3024.html
Private Sub parseValue()
Dim xpathDoc As XPathDocument
Dim xpathNav As XPathNavigator
Dim xpathNode As XPathNodeIterator
Dim file As IO.StreamWriter
Dim xmlfile As IO.StreamReader
Dim tidyDoc As Tidy.Document
Dim iErrCode As Integer
Dim sXPath As String

Try
downloadFile(AxWebBrowser.LocationURL, "temp.xml")
tidyDoc = New Tidy.Document
iErrCode = tidyDoc.ParseFile("temp.xml")
If iErrCode < 0 Then
Throw New Exception("Couldn't parse the file.")
Else
tidyDoc.SetOptBool(TidyATL.TidyOptionId.TidyXhtmlOut, 1)
tidyDoc.SetOptInt(TidyATL.TidyOptionId.TidyIndentContent,
2)
tidyDoc.SetOptInt(TidyATL.TidyOptionId.TidyIndentSpaces,
4)
iErrCode = tidyDoc.CleanAndRepair
If iErrCode < 0 Then
Throw New Exception("Couldn't repair the file.")
Else
tidyDoc.SaveFile("temp.xml")
xmlfile = New IO.StreamReader("temp.xml")
xpathDoc = New XPathDocument(xmlfile)
xpathNav = xpathDoc.CreateNavigator
sXPath = "/html/body/div/div[5]/div/span/text()"

'xpathNode = xpathNav.Evaluate(sXPath)
'Dim expr As XPathExpression
'expr = xpathNav.Compile(sXPath)
'xpathNode = xpathNav.Select(expr)

xpathNode = xpathNav.Select(sXPath)

If xpathNode.MoveNext() Then
MsgBox(xpathNode.Current.Value)
End If
End If
End If

Catch ex As Exception
MsgBox("Error:" + ex.Message)
Finally
IO.File.Delete("temp.xml")
End Try
End Sub
</code>

As you can see I tried XPathNavigator's methods Compile, Evaluate and
Select, but nothing works. I tried the same XPath expression in the
same document in jEdit and the XSLT/XPath plugin and it returns
exactly what I want.
My downloadFile function looks like this
<code>
Private Sub downloadFile(ByVal url As String, ByVal filename As
String)
Dim wr As System.Net.HttpWebRequest =
CType(System.Net.WebRequest.Create(url), System.Net.HttpWebRequest)
Dim ws As System.Net.HttpWebResponse = CType(wr.GetResponse(),
System.Net.HttpWebResponse)
Dim str As System.IO.Stream = ws.GetResponseStream()
Dim inBuf(100000) As Byte
Dim bytesToRead As Integer = CInt(inBuf.Length)
Dim bytesRead As Integer = 0
While bytesToRead > 0
Dim n As Integer = str.Read(inBuf, bytesRead, bytesToRead)
If n = 0 Then
Exit While
End If
bytesRead += n
bytesToRead -= n
End While
Dim fstr As New System.IO.FileStream(filename,
System.IO.FileMode.OpenOrCreate, System.IO.FileAccess.Write)
fstr.Write(inBuf, 0, bytesRead)
str.Close()
fstr.Close()
End Sub
</code>

I am a programmer in vocational education and I am happy about every
answer.

Regards,

Thomas
 
Thomas said:
tidyDoc.SaveFile("temp.xml")
xmlfile = New IO.StreamReader("temp.xml")
xpathDoc = New XPathDocument(xmlfile)
xpathNav = xpathDoc.CreateNavigator
sXPath = "/html/body/div/div[5]/div/span/text()"

'xpathNode = xpathNav.Evaluate(sXPath)
'Dim expr As XPathExpression
'expr = xpathNav.Compile(sXPath)
'xpathNode = xpathNav.Select(expr)

xpathNode = xpathNav.Select(sXPath)

Just guessing what might happen: If Tidy creates an XHTML document then
all XHTML elements are in the namespace http://www.w3.org/1999/xhtml and
your XPath needs to bind a prefix to the namespace to address such
elements e.g.
Dim nsmgr As XmlNamespaceManager = New
XmlNamespaceManager(xpath.Nav.NameTable)
nsmgr.AddNamespace("x", "http://www.w3.org/1999/xhtml")
Dim nodeIterator As XPathNodeIterator =
xpathNav.Select("/x:html/x:body/x:div/x:div[5]/x:div/x:span/text()", nsmgr)

Try that, if it does not help then please show us the relevant XHTML
tidy creates, then we can suggest the proper XPath expressions.
 
Thank you for your reply!

Just guessing what might happen: If Tidy creates an XHTML document then
all XHTML elements are in the namespacehttp://www.w3.org/1999/xhtmla nd
your XPath needs to bind a prefix to the namespace to address such
elements e.g.

Right, a new DOCTYPE and a new namespace is added:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Dim nsmgr As XmlNamespaceManager = New
XmlNamespaceManager(xpathNav.NameTable)
nsmgr.AddNamespace("x", "http://www.w3.org/1999/xhtml")
Dim nodeIterator As XPathNodeIterator =
xpathNav.Select("/x:html/x:body/x:div/x:div[5]/x:div/x:span/text()", nsmgr)

Try that, if it does not help then please show us the relevant XHTML
tidy creates, then we can suggest the proper XPath expressions.

I tried it, but XPathNavigator.Select() accepts only one argument and
no XmlNamespaceManager. If I use it your way with the nsmgr, I'll get
the error message "Namespace Manager or XsltContext needed". So i can
call it only that way:
Dim nodeIterator As XPathNodeIterator = xpathNav.Select("/x:html/
x:body/x:div/x:div[5]/x:div/x:span/text()")

When I searched for the error message, I found that you can use
XmlDocument.DocumentElement.SelectNode(xpath as String, nsmgr as
NamespaceManager) to select a XmlNodeList. So I tried to use
XmlDocument:


Dim reader As XmlTextReader = New XmlTextReader("temp.xml")
Dim xmlDoc As XmlDocument = New XmlDocument
Dim xnod As XmlNode = xmlDoc.DocumentElement
Dim nodeList As XmlNodeList
Dim nsmgr As XmlNamespaceManager = New
XmlNamespaceManager(xpathNav.NameTable)

reader.WhitespaceHandling = WhitespaceHandling.None
nsmgr.AddNamespace("x", "http://www.w3.org/1999/xhtml")
nsmgr.PushScope()

'Takes 3 minutes
xmlDoc.Load(reader)
reader.Close()

xnod = xmlDoc.DocumentElement
sXPath = "/x:html/x:body/x:div/x:div[5]/x:div/x:span/text()"
nodeList = xnod.SelectNodes(sXPath, nsmgr)
For Each node As XmlNode In nodeList
Console.WriteLine(node.Value)
Next

It works, but it takes up to 3 minutes to load the XHTML file.
Can you suggest a better way? Did I do something wrong?

Regards, Thomas
 
Thomas said:
I tried it, but XPathNavigator.Select() accepts only one argument and
no XmlNamespaceManager. If I use it your way with the nsmgr, I'll get
the error message "Namespace Manager or XsltContext needed". So i can
call it only that way:
Dim nodeIterator As XPathNodeIterator = xpathNav.Select("/x:html/
x:body/x:div/x:div[5]/x:div/x:span/text()")

Are you still using .NET 1.x? With .NET 2.0 or later the Select method
has an overload taking the second argument:
<URL:http://msdn2.microsoft.com/en-us/library/System.Xml.XPath.XPathNavigator.Select.aspx>

If you are using .NET 1.x and want to use such an expression then you
need to write some more lines of code e.g.
Dim expression As XPathExpression =
xpathNav.Compile("/x:html/x:body/x:div/x:div[5]/x:div/x:span/text()")
expression.SetContext(nsmgr)
Dim nodeIterator As XPathNodeIterator = xpathNav.Select(expression)

The code for setting up the namespace manager remains as posted in my
earlier reply.


It works, but it takes up to 3 minutes to load the XHTML file.
Can you suggest a better way? Did I do something wrong?

The DTD is fetched from the W3C web server and processed, that takes
time. You can use an XmlTextReader where you set the XmlResolver
property to Nothing to avoid fetching the DTD.

As a complete alternative to your current approach (e.g. using Tidy to
create a new XHTML document and using then XPath to access nodes) you
might want to look into the "HTML agility pack"
<URL:http://www.codeplex.com/htmlagilitypack>, that way you can use
"XPath over the HTML document" loaded from the server and don't need Tidy.
 
Are you still using .NET 1.x? With .NET 2.0 or later the Select method
has an overload taking the second argument:
<URL:http://msdn2.microsoft.com/en-us/library/System.Xml.XPath.XPathNaviga...>

Sorry, I forgot to post that I am using .Net 1.1
If you are using .NET 1.x and want to use such an expression then you
need to write some more lines of code e.g.
Dim expression As XPathExpression =
xpathNav.Compile("/x:html/x:body/x:div/x:div[5]/x:div/x:span/text()")
expression.SetContext(nsmgr)
Dim nodeIterator As XPathNodeIterator = xpathNav.Select(expression)

The code for setting up the namespace manager remains as posted in my
earlier reply.
It works, but it takes up to 3 minutes to load the XHTML file.
Can you suggest a better way? Did I do something wrong?

The DTD is fetched from the W3C web server and processed, that takes
time. You can use an XmlTextReader where you set the XmlResolver
property to Nothing to avoid fetching the DTD.

Great, thank you!
As a complete alternative to your current approach (e.g. using Tidy to
create a new XHTML document and using then XPath to access nodes) you
might want to look into the "HTML agility pack"
<URL:http://www.codeplex.com/htmlagilitypack>, that way you can use
"XPath over the HTML document" loaded from the server and don't need Tidy.

I will look into it. Thanks!
 
Back
Top