Screen Scraper

kronecker · Aug 12, 2008

A screen scraper is a program that removes text only from a web site.
I pinched this one from the web:

Public Class Form1
Private Sub Form1_Load(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles MyBase.Load
Me.TextBox1.Multiline = True
Me.TextBox1.ScrollBars = ScrollBars.Both
'above only for showing the sample
Dim Doc As mshtml.IHTMLDocument2
Doc = New mshtml.HTMLDocumentClass
Dim wbReq As Net.HttpWebRequest = _
DirectCast(Net.WebRequest.Create("http://
start.csail.mit.edu/startfarm.cgi?query=USA"), _
Net.HttpWebRequest)
Dim wbResp As Net.HttpWebResponse = _
DirectCast(wbReq.GetResponse(), Net.HttpWebResponse)
Dim wbHCol As Net.WebHeaderCollection = wbResp.Headers
Dim myStream As IO.Stream = wbResp.GetResponseStream()
Dim myreader As New IO.StreamReader(myStream)
Doc.write(myreader.ReadToEnd())
Doc.close()
wbResp.Close()

'the part below is not completly done for all tags.
'it can (will) be necessary to tailor that to your needs.

Dim sb As New System.Text.StringBuilder
For i As Integer = 0 To Doc.all.length - 1
Dim hElm As mshtml.IHTMLElement = _
DirectCast(Doc.all.item(i), mshtml.IHTMLElement)
Select Case hElm.tagName.ToLower
Case "body" '"html" ' "head" ' "form"
Case Else
If hElm.innerText <> "" Then
sb.Append(hElm.innerText & vbCrLf)
End If
End Select
Next
TextBox1.Text = sb.ToString
End Sub

the trouble is that it gives text out that is duplicated in multiple
lines of the same info.
I explored this in a separate thread where I tried to fix it by
writing it to a text file and looking for duplicates. however, it
would be far easier to fix teh scraper itself.
I am unfamiliar with mshtml coding but essentially it is looking for
Tags "body of text html,head etc. Any suggestions as to why it
duplicates would be great.

K.

Cor Ligthert[MVP] · Aug 12, 2008

Kronecker,

The HttpRequest gives you only back the HTML content of the document that is
in the URL, that is not a page as you see it.

If you want to do as I understand you need to use the DOM (Document Object
Model) represented by MSHTML and learn what MSHTML is (in fact it has all
elements from DHTML).

As you know that, then you can use the Document property from the WebBrowser
to get that HTML. Be aware that one page can be made from more Frames and so
called IFrames. As it is like that, you have to evaluate all documents
(every frame contains a document). Therefore the AXWebbrowser has a
document.complete event and a download.complete event (for the webbrowser
there is an other way).

If you look in at the bottom of IE, you see that downloading happens,
because images and more things as like flash are also seperated downloaded.

Working with MSHTML is not an easy thing, because it has classes, which
should be often casted and sometimes even very deep, because the casted
class uses members which too should be casted.

The last thing is that most webcreaters are not always as correct as it
should be and there are on many pages, including from very profesional
companies, often many errors. Often they are created like: "As it works on
my screen then it is correct".

Cor

Screen Scraper

kronecker

Cor Ligthert[MVP]