R
remy rakic
Hi all, i was trying to parse some HTML and found myself in trouble with
some regex processing (which i have never done before).
What i am trying to do is to get content between two tags, including any
html code. I can do stuff like this:
"<a>([\w\s]*)</a>" on "<a>Not cool</a><a>Absolutely not</a>" obviously only
gets regular text content but no html tags, i wonder if someone could
enlighten me on which regex to use in order to get results "<really>Really
not<cool/><at>all</at>" and "Absolutely not" on the string
"<tag><tag2><a><really>Really
not<cool/><at>all</at></a></tag2>...<tag3><a>Absolutely
not</a></tag3></tag>" ? (Notice i can't use Xpath since i'm not sure whether
the site is XHTML compliant or not (as the example is no xml))
Should i process the content twice, or give up the regex approach for a
regular 'string index' parsing?
Thanks in advance
some regex processing (which i have never done before).
What i am trying to do is to get content between two tags, including any
html code. I can do stuff like this:
"<a>([\w\s]*)</a>" on "<a>Not cool</a><a>Absolutely not</a>" obviously only
gets regular text content but no html tags, i wonder if someone could
enlighten me on which regex to use in order to get results "<really>Really
not<cool/><at>all</at>" and "Absolutely not" on the string
"<tag><tag2><a><really>Really
not<cool/><at>all</at></a></tag2>...<tag3><a>Absolutely
not</a></tag3></tag>" ? (Notice i can't use Xpath since i'm not sure whether
the site is XHTML compliant or not (as the example is no xml))
Should i process the content twice, or give up the regex approach for a
regular 'string index' parsing?
Thanks in advance