K
kronecker
A screen scraper is a program that removes text only from a web site.
I pinched this one from the web:
Public Class Form1
Private Sub Form1_Load(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles MyBase.Load
Me.TextBox1.Multiline = True
Me.TextBox1.ScrollBars = ScrollBars.Both
'above only for showing the sample
Dim Doc As mshtml.IHTMLDocument2
Doc = New mshtml.HTMLDocumentClass
Dim wbReq As Net.HttpWebRequest = _
DirectCast(Net.WebRequest.Create("http://
start.csail.mit.edu/startfarm.cgi?query=USA"), _
Net.HttpWebRequest)
Dim wbResp As Net.HttpWebResponse = _
DirectCast(wbReq.GetResponse(), Net.HttpWebResponse)
Dim wbHCol As Net.WebHeaderCollection = wbResp.Headers
Dim myStream As IO.Stream = wbResp.GetResponseStream()
Dim myreader As New IO.StreamReader(myStream)
Doc.write(myreader.ReadToEnd())
Doc.close()
wbResp.Close()
'the part below is not completly done for all tags.
'it can (will) be necessary to tailor that to your needs.
Dim sb As New System.Text.StringBuilder
For i As Integer = 0 To Doc.all.length - 1
Dim hElm As mshtml.IHTMLElement = _
DirectCast(Doc.all.item(i), mshtml.IHTMLElement)
Select Case hElm.tagName.ToLower
Case "body" '"html" ' "head" ' "form"
Case Else
If hElm.innerText <> "" Then
sb.Append(hElm.innerText & vbCrLf)
End If
End Select
Next
TextBox1.Text = sb.ToString
End Sub
the trouble is that it gives text out that is duplicated in multiple
lines of the same info.
I explored this in a separate thread where I tried to fix it by
writing it to a text file and looking for duplicates. however, it
would be far easier to fix teh scraper itself.
I am unfamiliar with mshtml coding but essentially it is looking for
Tags "body of text html,head etc. Any suggestions as to why it
duplicates would be great.
K.
I pinched this one from the web:
Public Class Form1
Private Sub Form1_Load(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles MyBase.Load
Me.TextBox1.Multiline = True
Me.TextBox1.ScrollBars = ScrollBars.Both
'above only for showing the sample
Dim Doc As mshtml.IHTMLDocument2
Doc = New mshtml.HTMLDocumentClass
Dim wbReq As Net.HttpWebRequest = _
DirectCast(Net.WebRequest.Create("http://
start.csail.mit.edu/startfarm.cgi?query=USA"), _
Net.HttpWebRequest)
Dim wbResp As Net.HttpWebResponse = _
DirectCast(wbReq.GetResponse(), Net.HttpWebResponse)
Dim wbHCol As Net.WebHeaderCollection = wbResp.Headers
Dim myStream As IO.Stream = wbResp.GetResponseStream()
Dim myreader As New IO.StreamReader(myStream)
Doc.write(myreader.ReadToEnd())
Doc.close()
wbResp.Close()
'the part below is not completly done for all tags.
'it can (will) be necessary to tailor that to your needs.
Dim sb As New System.Text.StringBuilder
For i As Integer = 0 To Doc.all.length - 1
Dim hElm As mshtml.IHTMLElement = _
DirectCast(Doc.all.item(i), mshtml.IHTMLElement)
Select Case hElm.tagName.ToLower
Case "body" '"html" ' "head" ' "form"
Case Else
If hElm.innerText <> "" Then
sb.Append(hElm.innerText & vbCrLf)
End If
End Select
Next
TextBox1.Text = sb.ToString
End Sub
the trouble is that it gives text out that is duplicated in multiple
lines of the same info.
I explored this in a separate thread where I tried to fix it by
writing it to a text file and looking for duplicates. however, it
would be far easier to fix teh scraper itself.
I am unfamiliar with mshtml coding but essentially it is looking for
Tags "body of text html,head etc. Any suggestions as to why it
duplicates would be great.
K.