S
Sriram Krishnan
I'm doing some search-engine related work and want to match the actual
content of a html page (i.e any character which is not between a < and a >).
I first wrote
(?:\<.*?>) (?<content>.*?) <?:\<.*?>)
which basically says match any text between a opening and a closing tag. The
problem with this is that you almost always have nested tags.This exp is
braindead as it chokes on nested tags. So this would match something like <a
href=""><img/></a>/ (this would match the '<img/>' part).
So I came up with
(?![\<|>].*?>)
But the problem with this negative-look ahead is that it doesnt advance
beyond the first negation - it just stops there.I have a feeling that saying
what I *dont* want is the way to go.
I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which says
something like - 'match any text that doesnt match this expression'. Or is
there any way to do reursive regex matching - that is , within a pattern,
match the pattern itself?In that case, the first pattern could be made to
work as I could have a recursive call inside the (?<content>) pattern which
keeps going down until you dont have any more nested tags
Thannks in advance
content of a html page (i.e any character which is not between a < and a >).
I first wrote
(?:\<.*?>) (?<content>.*?) <?:\<.*?>)
which basically says match any text between a opening and a closing tag. The
problem with this is that you almost always have nested tags.This exp is
braindead as it chokes on nested tags. So this would match something like <a
href=""><img/></a>/ (this would match the '<img/>' part).
So I came up with
(?![\<|>].*?>)
But the problem with this negative-look ahead is that it doesnt advance
beyond the first negation - it just stops there.I have a feeling that saying
what I *dont* want is the way to go.
I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which says
something like - 'match any text that doesnt match this expression'. Or is
there any way to do reursive regex matching - that is , within a pattern,
match the pattern itself?In that case, the first pattern could be made to
work as I could have a recursive call inside the (?<content>) pattern which
keeps going down until you dont have any more nested tags
Thannks in advance