M
Mike Labosh
Greetings:
I'm writing a utility that scrapes certain data and text from web pages, but
I'm having trouble with the pattern that I want to use to remove HTML tags.
"\<.+\>(\r\n)*" works _really_ well, but I'm having trouble with <a> tags.
For links, I want to keep the text of the link and discard the HTML, ie, <a
href="someurl">Get this document</a> should become simply Get this document
I also tried several variations of "\<[^\>]+\>(\r\n)*" to see if I can drop
everything inside a set of <> that is not a >.
Any help or thoughts?
I'm writing a utility that scrapes certain data and text from web pages, but
I'm having trouble with the pattern that I want to use to remove HTML tags.
"\<.+\>(\r\n)*" works _really_ well, but I'm having trouble with <a> tags.
For links, I want to keep the text of the link and discard the HTML, ie, <a
href="someurl">Get this document</a> should become simply Get this document
I also tried several variations of "\<[^\>]+\>(\r\n)*" to see if I can drop
everything inside a set of <> that is not a >.
Any help or thoughts?