T
Thomas Wieczorek
Hello!
I have to extract various product data from online shops, e.g. price,
title, description. I planned to create a form with a combobox to
choose the shops from, a Microsoft Browser to display the sites and
some labels/textboxes to display the extracted data. It is quite
simple so far.
Every shop has a different structure, so I thought, I could use XPath
expressions to point to the HTML tags, where I can find the product
data. Sounded easy to me, but System.XML.XPath doesn't works on non-
standard XML-Documents. So I added the HTMLTidy DLL to the project to
convert the HTML file to XHTML. XPathDocument accepts the XHTML file
now, but the XPathNavigator does not return anything.
My code:
<code>
' Test site is at AxWebBrowser.LocationURL =
' https://www.secomp.de/ishopWebFront.../para/node/is//and/product/is/15.20.3024.html
Private Sub parseValue()
Dim xpathDoc As XPathDocument
Dim xpathNav As XPathNavigator
Dim xpathNode As XPathNodeIterator
Dim file As IO.StreamWriter
Dim xmlfile As IO.StreamReader
Dim tidyDoc As Tidy.Document
Dim iErrCode As Integer
Dim sXPath As String
Try
downloadFile(AxWebBrowser.LocationURL, "temp.xml")
tidyDoc = New Tidy.Document
iErrCode = tidyDoc.ParseFile("temp.xml")
If iErrCode < 0 Then
Throw New Exception("Couldn't parse the file.")
Else
tidyDoc.SetOptBool(TidyATL.TidyOptionId.TidyXhtmlOut, 1)
tidyDoc.SetOptInt(TidyATL.TidyOptionId.TidyIndentContent,
2)
tidyDoc.SetOptInt(TidyATL.TidyOptionId.TidyIndentSpaces,
4)
iErrCode = tidyDoc.CleanAndRepair
If iErrCode < 0 Then
Throw New Exception("Couldn't repair the file.")
Else
tidyDoc.SaveFile("temp.xml")
xmlfile = New IO.StreamReader("temp.xml")
xpathDoc = New XPathDocument(xmlfile)
xpathNav = xpathDoc.CreateNavigator
sXPath = "/html/body/div/div[5]/div/span/text()"
'xpathNode = xpathNav.Evaluate(sXPath)
'Dim expr As XPathExpression
'expr = xpathNav.Compile(sXPath)
'xpathNode = xpathNav.Select(expr)
xpathNode = xpathNav.Select(sXPath)
If xpathNode.MoveNext() Then
MsgBox(xpathNode.Current.Value)
End If
End If
End If
Catch ex As Exception
MsgBox("Error:" + ex.Message)
Finally
IO.File.Delete("temp.xml")
End Try
End Sub
</code>
As you can see I tried XPathNavigator's methods Compile, Evaluate and
Select, but nothing works. I tried the same XPath expression in the
same document in jEdit and the XSLT/XPath plugin and it returns
exactly what I want.
My downloadFile function looks like this
<code>
Private Sub downloadFile(ByVal url As String, ByVal filename As
String)
Dim wr As System.Net.HttpWebRequest =
CType(System.Net.WebRequest.Create(url), System.Net.HttpWebRequest)
Dim ws As System.Net.HttpWebResponse = CType(wr.GetResponse(),
System.Net.HttpWebResponse)
Dim str As System.IO.Stream = ws.GetResponseStream()
Dim inBuf(100000) As Byte
Dim bytesToRead As Integer = CInt(inBuf.Length)
Dim bytesRead As Integer = 0
While bytesToRead > 0
Dim n As Integer = str.Read(inBuf, bytesRead, bytesToRead)
If n = 0 Then
Exit While
End If
bytesRead += n
bytesToRead -= n
End While
Dim fstr As New System.IO.FileStream(filename,
System.IO.FileMode.OpenOrCreate, System.IO.FileAccess.Write)
fstr.Write(inBuf, 0, bytesRead)
str.Close()
fstr.Close()
End Sub
</code>
I am a programmer in vocational education and I am happy about every
answer.
Regards,
Thomas
I have to extract various product data from online shops, e.g. price,
title, description. I planned to create a form with a combobox to
choose the shops from, a Microsoft Browser to display the sites and
some labels/textboxes to display the extracted data. It is quite
simple so far.
Every shop has a different structure, so I thought, I could use XPath
expressions to point to the HTML tags, where I can find the product
data. Sounded easy to me, but System.XML.XPath doesn't works on non-
standard XML-Documents. So I added the HTMLTidy DLL to the project to
convert the HTML file to XHTML. XPathDocument accepts the XHTML file
now, but the XPathNavigator does not return anything.
My code:
<code>
' Test site is at AxWebBrowser.LocationURL =
' https://www.secomp.de/ishopWebFront.../para/node/is//and/product/is/15.20.3024.html
Private Sub parseValue()
Dim xpathDoc As XPathDocument
Dim xpathNav As XPathNavigator
Dim xpathNode As XPathNodeIterator
Dim file As IO.StreamWriter
Dim xmlfile As IO.StreamReader
Dim tidyDoc As Tidy.Document
Dim iErrCode As Integer
Dim sXPath As String
Try
downloadFile(AxWebBrowser.LocationURL, "temp.xml")
tidyDoc = New Tidy.Document
iErrCode = tidyDoc.ParseFile("temp.xml")
If iErrCode < 0 Then
Throw New Exception("Couldn't parse the file.")
Else
tidyDoc.SetOptBool(TidyATL.TidyOptionId.TidyXhtmlOut, 1)
tidyDoc.SetOptInt(TidyATL.TidyOptionId.TidyIndentContent,
2)
tidyDoc.SetOptInt(TidyATL.TidyOptionId.TidyIndentSpaces,
4)
iErrCode = tidyDoc.CleanAndRepair
If iErrCode < 0 Then
Throw New Exception("Couldn't repair the file.")
Else
tidyDoc.SaveFile("temp.xml")
xmlfile = New IO.StreamReader("temp.xml")
xpathDoc = New XPathDocument(xmlfile)
xpathNav = xpathDoc.CreateNavigator
sXPath = "/html/body/div/div[5]/div/span/text()"
'xpathNode = xpathNav.Evaluate(sXPath)
'Dim expr As XPathExpression
'expr = xpathNav.Compile(sXPath)
'xpathNode = xpathNav.Select(expr)
xpathNode = xpathNav.Select(sXPath)
If xpathNode.MoveNext() Then
MsgBox(xpathNode.Current.Value)
End If
End If
End If
Catch ex As Exception
MsgBox("Error:" + ex.Message)
Finally
IO.File.Delete("temp.xml")
End Try
End Sub
</code>
As you can see I tried XPathNavigator's methods Compile, Evaluate and
Select, but nothing works. I tried the same XPath expression in the
same document in jEdit and the XSLT/XPath plugin and it returns
exactly what I want.
My downloadFile function looks like this
<code>
Private Sub downloadFile(ByVal url As String, ByVal filename As
String)
Dim wr As System.Net.HttpWebRequest =
CType(System.Net.WebRequest.Create(url), System.Net.HttpWebRequest)
Dim ws As System.Net.HttpWebResponse = CType(wr.GetResponse(),
System.Net.HttpWebResponse)
Dim str As System.IO.Stream = ws.GetResponseStream()
Dim inBuf(100000) As Byte
Dim bytesToRead As Integer = CInt(inBuf.Length)
Dim bytesRead As Integer = 0
While bytesToRead > 0
Dim n As Integer = str.Read(inBuf, bytesRead, bytesToRead)
If n = 0 Then
Exit While
End If
bytesRead += n
bytesToRead -= n
End While
Dim fstr As New System.IO.FileStream(filename,
System.IO.FileMode.OpenOrCreate, System.IO.FileAccess.Write)
fstr.Write(inBuf, 0, bytesRead)
str.Close()
fstr.Close()
End Sub
</code>
I am a programmer in vocational education and I am happy about every
answer.
Regards,
Thomas