K
Krakatioison
I realized I am an idiot. Whole saturday wasted on this crazy regex crap, I
can't even look at it anymore...uff
Honestly, I am even willing to pay to someone who is able to solve this.
Please help, I really used all my knowledge on this and I cannot get it to
work.
Let me explain:
I need to parse, URL , URL TEXT and DESCRIPTION TEXT from GOOGLE NEWS.
Every headline in google news has the same structure and it looks like this:
<a class="y"
href="http://news.google.com/url?ntc=05SA0&q=http://www.canada.com/sports/st
ory.html%3Fid%3D5FBD7D23-AA7A-4E4C-AE1D-01CEBC350782"> Everyone expected
Ullrich to show up - instead it was teammate Kloden</a><br><font size="-1"
style="font-family: arial,sans-serif"><b>
<font color="#6f6f6f" style="font-family:
arial,sans-serif">Canada.com -</font>15 minutes ago</b><br>
BESANCON, France (AP) - Look for a German cyclist to be on the Tour de
France podium Sunday - just not the one most people expected. <br>
go to lets say: http://news.google.com/news/en/us/sports.html and look up
the source if needed.
So, this is my parser (function) for grabbing links. And it works (thanks
God) and I am able to extract with this function all 20 major headlines URL
links:
Public Function ParseLinks(ByVal HTML As String) As ArrayList
Dim objRegEx As System.Text.RegularExpressions.Regex
Dim objMatch As System.Text.RegularExpressions.Match
Dim arrLinks As New System.Collections.ArrayList
objRegEx = New System.Text.RegularExpressions.Regex("(?:y
[hH][rR][eE][fF]\s*=)(?:[\s""']*)(?!#|[Mm]ailto|[lL]ocation.|[jJ]avascript|.
*css|.*this\.)(.*?)(?:[\s>""'])",
System.Text.RegularExpressions.RegexOptions.IgnoreCase Or
System.Text.RegularExpressions.RegexOptions.Compiled)
objMatch = objRegEx.Match(HTML)
While objMatch.Success
Dim strMatch As String
strMatch = objMatch.Groups(1).ToString
arrLinks.Add(strMatch)
objMatch = objMatch.NextMatch()
End While
Return arrLinks
End Function
Now as you probably guessed already, my problem is that I am simply not able
to write the same function for extraction of URL TEXT and DESCRIPTION TEXT.
please help
K.
can't even look at it anymore...uff
Honestly, I am even willing to pay to someone who is able to solve this.
Please help, I really used all my knowledge on this and I cannot get it to
work.
Let me explain:
I need to parse, URL , URL TEXT and DESCRIPTION TEXT from GOOGLE NEWS.
Every headline in google news has the same structure and it looks like this:
<a class="y"
href="http://news.google.com/url?ntc=05SA0&q=http://www.canada.com/sports/st
ory.html%3Fid%3D5FBD7D23-AA7A-4E4C-AE1D-01CEBC350782"> Everyone expected
Ullrich to show up - instead it was teammate Kloden</a><br><font size="-1"
style="font-family: arial,sans-serif"><b>
<font color="#6f6f6f" style="font-family:
arial,sans-serif">Canada.com -</font>15 minutes ago</b><br>
BESANCON, France (AP) - Look for a German cyclist to be on the Tour de
France podium Sunday - just not the one most people expected. <br>
go to lets say: http://news.google.com/news/en/us/sports.html and look up
the source if needed.
So, this is my parser (function) for grabbing links. And it works (thanks
God) and I am able to extract with this function all 20 major headlines URL
links:
Public Function ParseLinks(ByVal HTML As String) As ArrayList
Dim objRegEx As System.Text.RegularExpressions.Regex
Dim objMatch As System.Text.RegularExpressions.Match
Dim arrLinks As New System.Collections.ArrayList
objRegEx = New System.Text.RegularExpressions.Regex("(?:y
[hH][rR][eE][fF]\s*=)(?:[\s""']*)(?!#|[Mm]ailto|[lL]ocation.|[jJ]avascript|.
*css|.*this\.)(.*?)(?:[\s>""'])",
System.Text.RegularExpressions.RegexOptions.IgnoreCase Or
System.Text.RegularExpressions.RegexOptions.Compiled)
objMatch = objRegEx.Match(HTML)
While objMatch.Success
Dim strMatch As String
strMatch = objMatch.Groups(1).ToString
arrLinks.Add(strMatch)
objMatch = objMatch.NextMatch()
End While
Return arrLinks
End Function
Now as you probably guessed already, my problem is that I am simply not able
to write the same function for extraction of URL TEXT and DESCRIPTION TEXT.
please help
K.