System.Text.RegularExpressions.Regex

Mike Labosh · May 19, 2004

Greetings:

I'm writing a utility that scrapes certain data and text from web pages, but
I'm having trouble with the pattern that I want to use to remove HTML tags.

"\<.+\>(\r\n)*" works _really_ well, but I'm having trouble with <a> tags.
For links, I want to keep the text of the link and discard the HTML, ie, <a
href="someurl">Get this document</a> should become simply Get this document

I also tried several variations of "\<[^\>]+\>(\r\n)*" to see if I can drop
everything inside a set of <> that is not a >.

Any help or thoughts?

Mike Labosh · May 20, 2004

I GOT IT! YAY!

Used in the app by calling the respective Replace() method, rxDrop throws
away all HTML tags except for the "stuff" between tags of the form
<tag>stuff</tag>, as well as extra CRLF's. rxKeep does the same thing but
retains extra CRLF's. Feel free to reuse, and happy Regex'ing to all.

Private Function stripHTML( _
ByVal html As String, _
ByVal keepCRLF As Boolean _
) As String

Dim rxDrop As New Regex("(\<[^\>]+)\>(\r\n)*")
Dim rxKeep As New Regex("(\<[^\>]+)\>")

If keepCRLF Then
Return rxKeep.Replace(html, "").Trim()
Else
Return rxDrop.Replace(html, "").Trim()
End If

End Function

System.Text.RegularExpressions.Regex

Mike Labosh

Mike Labosh