HTML ContentParser in vb.net

  • Thread starter Thread starter Seven Stars
  • Start date Start date
S

Seven Stars

Hey Guys, I have the code for this html content parser, see bellow.
My program should go to google and extraxt all refferencing link to
the link I insert.
So if I ask my app to find all links to www.mysite.com then it should
go to google with this parameter: "links: www.mysite.com".
Problem is that this parser will always extraxt links only from the
first page on google.
What if google suggests over 1000 links? How do I swith from page to
page? How do I know when it's end?
This is, (believe it or not) a problem for me.
7*


Imports System.IO
Imports System.Net
Imports System
Imports System.Text
Imports System.Text.RegularExpressions
Public Class HTMLContentParser
Function Return_HTMLContent(ByVal sURL As String)
Dim sStream As Stream
Dim URLReq As HttpWebRequest
Dim URLRes As HttpWebResponse
Try
URLReq = WebRequest.Create(sURL)
URLRes = URLReq.GetResponse()
sStream = URLRes.GetResponseStream()
Return New
StreamReader(sStream).ReadToEnd()
Catch ex As Exception
Return ex.Message
End Try
End Function
Function ParseHTMLLinks(ByVal sHTMLContent As String, ByVal
sURL As String) As ArrayList
Dim rRegEx As Regex
Dim mMatch As Match
Dim aMatch As New ArrayList
rRegEx = New
Regex("a.*href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))",
RegexOptions.IgnoreCase Or RegexOptions.Compiled)
mMatch = rRegEx.Match(sHTMLContent)
While mMatch.Success
Dim sMatch As String
sMatch = ProcessURL(mMatch.Groups(1).ToString,
sURL)
aMatch.Add(sMatch)
mMatch = mMatch.NextMatch()
End While
Return aMatch
End Function
Function ParseHTMLImages(ByVal sHTMLContent As String, ByVal
sURL As String) As ArrayList
Dim rRegEx As Regex
Dim mMatch As Match
Dim aMatch As New ArrayList
rRegEx = New
Regex("img.*src\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))",
RegexOptions.IgnoreCase Or RegexOptions.Compiled)
mMatch = rRegEx.Match(sHTMLContent)
While mMatch.Success
Dim sMatch As String
sMatch = ProcessURL(mMatch.Groups(1).ToString,
sURL)
aMatch.Add(sMatch)
mMatch = mMatch.NextMatch()
End While
Return aMatch
End Function
Private Function ProcessURL(ByVal sInput As String, ByVal sURL
As String)
'Find out if the sURL has a "/" after the Domain
Name 'If not, give a "/" at the end 'First, check out for
any slash after the 'Double Dashes of the http:// 'If there is NO
slash, then end the sURL string with a SLASH If InStr(8, sURL,
"/") = 0 Then
sURL += "/"
'End If
'FILTERING
'Filter down to the Domain Name Directory from the Right
Dim iCount As Integer
For iCount = sURL.Length To 1 Step -1
If Mid(sURL, iCount, 1) = "/" Then
sURL = Left(sURL, iCount)
Exit For
End If
Next
'Filter out the ">" from the Left
For iCount = 1 To sInput.Length
If Mid(sInput, iCount, 4) = ">" Then
sInput = Left(sInput, iCount - 1) 'Stop and
Take the Char before
Exit For
End If
Next
'Filter out unnecessary Characters
sInput = sInput.Replace("<",
Chr(39))
sInput = sInput.Replace(">",
Chr(39))
'sInput = sInput.Replace(""",
"")
sInput = sInput.Replace("'", "")
If (sInput.IndexOf("http://") <
0) Then
If (Not
(sInput.StartsWith("/")) And Not
(sURL.EndsWith("/"))) Then
Return sURL & "/" & sInput
Else
If (sInput.StartsWith("/"))
And (sURL.EndsWith("/")) Then
Return sURL.Substring(0, sURL.Length - 1)
+ sInput
Else
Return sURL + sInput
End If
End If
Else
Return sInput
End If
End Function
End Class
 
Only thing I can think of is that you'd do it this way.

do the first search:
http://www.google.ca/search?hl=en&ie=UTF-8&oe=UTF-8&q=links:+yoursite.com&meta=

then second page is like this:
[code:1:bb3e6e2bc8]http://www.google.ca/search?q=links...-8&oe=UTF-8&start=10&sa=N[/code:1:bb3e6e2bc8]

so you see additional parameter: start=10
then you'd go start=20 and so for...
once you get to the page where you'd count less than 10 extracted
links, you'll know that that is the last page and you'll stop...

That should be the easiest way of doing it..

Does anyone else have something to say to this problem? or would my
solution be at least ok?

Vjay

Backup at: http://www.dotnetboards.com/viewtopic.php?t=7621




http://www.newsfeed.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
 
On 3 Jan 2004 07:05:00 -0600,
Hey Guys, I have the code for this html content parser, see bellow.
My program should go to google and extraxt all refferencing link to
the link I insert.
So if I ask my app to find all links to www.mysite.com then it should
go to google with this parameter: "links: www.mysite.com".
Problem is that this parser will always extraxt links only from the
first page on google.
What if google suggests over 1000 links? How do I swith from page to
page? How do I know when it's end?
This is, (believe it or not) a problem for me.
7*

For fun, I wrote a threaded programmed called "Google Domain Name
Harvester" that uses a text file dictionary of words and repeatedly
submits searches to google for each word. It uses the google search
option of returning 100 links per page, and it processes up to 10
pages of links per search.

It then keeps a local cache of up to 5,000 unique domain names. Once
the local list of domain names gets larger than 5,000, it dumps to
list to a MS-SQL server table where it stores the "full" list of
names.

The main UI keeps counters; I have it running now (it's been running
for over a week):

Total Pages: 1,101,007
Total Bytes: 72,087,957,614
Total Domains: 4,175,306

So it's grabbed 1.1 million pages from google, totalling a little over
67 gigabytes of html downloaded. That's about 64k per web page. In
those pages, it's found 4.175 million unique domain names.

Removing the chaff, the real meat is this routine:

Public Function ProcessItem(ByVal nIndex As Integer)

Dim szWord As String = CType(m_WORDS.Item(nIndex), String)
Dim szWebPage As String
Dim szURL As String
Dim nSearchPage As Integer

For nSearchPage = 0 To 9

Try

RaiseEvent StatusUpdate(String.Format("{0}/{1} '{2}'
({3}/{4}) [{5}]", nIndex + 1, m_WORDS.Count(), szWord, nSearchPage,
10, m_DOMAINS.Count()), nIndex, m_nDomains, m_nPages, m_nBytes)

szURL = String.Format( _
"http://www.google.com/search?" _
+ "q={0}&num=100&hl=en&lr=&ie=UTF-8&oe=UTF-8" _
+ "&as_qdr=all&start={1}&sa=N&filter=0", _
szWord, nSearchPage * 100)

szWebPage = GetWebPage(szURL)
If Not IsNothing(szWebPage) Then
GetPageURLS(szWebPage)
m_nPages = m_nPages + 1
End If

Catch ex As Exception
'Ignored
End Try

Next


'Update Word ID
m_nWordID = m_nWordID + 1

End Function


That's how I did it..

// CHRIS
 
Back
Top