Regex doubt

  • Thread starter Thread starter Sriram Krishnan
  • Start date Start date
S

Sriram Krishnan

I'm doing some search-engine related work and want to match the actual
content of a html page (i.e any character which is not between a < and a >).
I first wrote

(?:\<.*?>) (?<content>.*?) <?:\<.*?>)

which basically says match any text between a opening and a closing tag. The
problem with this is that you almost always have nested tags.This exp is
braindead as it chokes on nested tags. So this would match something like <a
href=""><img/></a>/ (this would match the '<img/>' part).

So I came up with

(?![\<|>].*?>)

But the problem with this negative-look ahead is that it doesnt advance
beyond the first negation - it just stops there.I have a feeling that saying
what I *dont* want is the way to go.

I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which says
something like - 'match any text that doesnt match this expression'. Or is
there any way to do reursive regex matching - that is , within a pattern,
match the pattern itself?In that case, the first pattern could be made to
work as I could have a recursive call inside the (?<content>) pattern which
keeps going down until you dont have any more nested tags

Thannks in advance
 
A good friend of mine recently posted an article on his blog regarding using
regular expressions to match HTML. His article can be found at:

http://haacked.com/archive/2004/10/25/1471.aspx

Hope this helps.

--
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com


Sriram Krishnan said:
I'm doing some search-engine related work and want to match the actual
content of a html page (i.e any character which is not between a < and a
). I first wrote

(?:\<.*?>) (?<content>.*?) <?:\<.*?>)

which basically says match any text between a opening and a closing tag.
The problem with this is that you almost always have nested tags.This exp
is braindead as it chokes on nested tags. So this would match something
like <a href=""><img/></a>/ (this would match the '<img/>' part).

So I came up with

(?![\<|>].*?>)

But the problem with this negative-look ahead is that it doesnt advance
beyond the first negation - it just stops there.I have a feeling that
saying what I *dont* want is the way to go.

I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which
says something like - 'match any text that doesnt match this expression'.
Or is there any way to do reursive regex matching - that is , within a
pattern, match the pattern itself?In that case, the first pattern could be
made to work as I could have a recursive call inside the (?<content>)
pattern which keeps going down until you dont have any more nested tags

Thannks in advance
 
Nice article - but it doesnt do what I want. His article is on how to match
tags - my doubt is how to match all the other non-tag content. And I really
dont want to use HtmlAgilitypack - I'm learning RegEx and want to figure how
to do this. So what I'm looking for is something like "match all the text
that doesnt match that expression"

--
Sriram Krishnan

http://www.dotnetjunkies.com/weblog/sriram


Ben Lucas said:
A good friend of mine recently posted an article on his blog regarding
using regular expressions to match HTML. His article can be found at:

http://haacked.com/archive/2004/10/25/1471.aspx

Hope this helps.

--
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com


Sriram Krishnan said:
I'm doing some search-engine related work and want to match the actual
content of a html page (i.e any character which is not between a < and a
). I first wrote

(?:\<.*?>) (?<content>.*?) <?:\<.*?>)

which basically says match any text between a opening and a closing tag.
The problem with this is that you almost always have nested tags.This exp
is braindead as it chokes on nested tags. So this would match something
like <a href=""><img/></a>/ (this would match the '<img/>' part).

So I came up with

(?![\<|>].*?>)

But the problem with this negative-look ahead is that it doesnt advance
beyond the first negation - it just stops there.I have a feeling that
saying what I *dont* want is the way to go.

I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which
says something like - 'match any text that doesnt match this
expression'. Or is there any way to do reursive regex matching - that is
, within a pattern, match the pattern itself?In that case, the first
pattern could be made to work as I could have a recursive call inside the
(?<content>) pattern which keeps going down until you dont have any more
nested tags

Thannks in advance
 
Sriram,

I am not a Regular Expressions expert myself, but I ran this by Phil, the
author of the article I sent you. This was his response:

"Simplest option is to to a Regex.Replace with my expression and replace
with empty string. Then what you have left is non-tag content. Sometimes
the best use of Regexp is to match what you don't want and get rid of it."

--
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com


Sriram Krishnan said:
Nice article - but it doesnt do what I want. His article is on how to
match tags - my doubt is how to match all the other non-tag content. And I
really dont want to use HtmlAgilitypack - I'm learning RegEx and want to
figure how to do this. So what I'm looking for is something like "match
all the text that doesnt match that expression"

--
Sriram Krishnan

http://www.dotnetjunkies.com/weblog/sriram


Ben Lucas said:
A good friend of mine recently posted an article on his blog regarding
using regular expressions to match HTML. His article can be found at:

http://haacked.com/archive/2004/10/25/1471.aspx

Hope this helps.

--
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com


Sriram Krishnan said:
I'm doing some search-engine related work and want to match the actual
content of a html page (i.e any character which is not between a < and a
). I first wrote

(?:\<.*?>) (?<content>.*?) <?:\<.*?>)

which basically says match any text between a opening and a closing tag.
The problem with this is that you almost always have nested tags.This
exp is braindead as it chokes on nested tags. So this would match
something like <a href=""><img/></a>/ (this would match the '<img/>'
part).

So I came up with

(?![\<|>].*?>)

But the problem with this negative-look ahead is that it doesnt advance
beyond the first negation - it just stops there.I have a feeling that
saying what I *dont* want is the way to go.

I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which
says something like - 'match any text that doesnt match this
expression'. Or is there any way to do reursive regex matching - that is
, within a pattern, match the pattern itself?In that case, the first
pattern could be made to work as I could have a recursive call inside
the (?<content>) pattern which keeps going down until you dont have any
more nested tags

Thannks in advance
 
Thanks :-) Sometimes the answer stares at you in the face and you dont
bother to see it. I'll use this for now.

Out of academic interest, is there a way to do that - say that 'match all
text that doesnt match this pattern'?

Thanks once again

--
Sriram Krishnan

http://www.dotnetjunkies.com/weblog/sriram


Ben Lucas said:
Sriram,

I am not a Regular Expressions expert myself, but I ran this by Phil, the
author of the article I sent you. This was his response:

"Simplest option is to to a Regex.Replace with my expression and replace
with empty string. Then what you have left is non-tag content. Sometimes
the best use of Regexp is to match what you don't want and get rid of it."

--
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com


Sriram Krishnan said:
Nice article - but it doesnt do what I want. His article is on how to
match tags - my doubt is how to match all the other non-tag content. And
I really dont want to use HtmlAgilitypack - I'm learning RegEx and want
to figure how to do this. So what I'm looking for is something like
"match all the text that doesnt match that expression"

--
Sriram Krishnan

http://www.dotnetjunkies.com/weblog/sriram


Ben Lucas said:
A good friend of mine recently posted an article on his blog regarding
using regular expressions to match HTML. His article can be found at:

http://haacked.com/archive/2004/10/25/1471.aspx

Hope this helps.

--
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com


I'm doing some search-engine related work and want to match the actual
content of a html page (i.e any character which is not between a < and
a >). I first wrote

(?:\<.*?>) (?<content>.*?) <?:\<.*?>)

which basically says match any text between a opening and a closing
tag. The problem with this is that you almost always have nested
tags.This exp is braindead as it chokes on nested tags. So this would
match something like <a href=""><img/></a>/ (this would match the
'<img/>' part).

So I came up with

(?![\<|>].*?>)

But the problem with this negative-look ahead is that it doesnt advance
beyond the first negation - it just stops there.I have a feeling that
saying what I *dont* want is the way to go.

I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which
says something like - 'match any text that doesnt match this
expression'. Or is there any way to do reursive regex matching - that
is , within a pattern, match the pattern itself?In that case, the first
pattern could be made to work as I could have a recursive call inside
the (?<content>) pattern which keeps going down until you dont have any
more nested tags

Thannks in advance
 
Hi, This is Phil.

The problem with trying to specifically match HTML content is that HTML is a
nested structure. So imagine the following HTML string:

<p><p><p><p>text</p>blah</p></p></p>

Trying to match text there is very difficult. Regular expressions aren't
well suited for matching nested structures. Microsoft did innovate a syntax
for matching nested structures in .NET, but it's not standard Regex (not that
anything is standard) and is difficult to understand.

If you try a negation approach, you can try using negative lookahead.
Something like:

(?!expression).*

This basically says look ahead without consuming any characters and if the
following sequence does NOT match the expression, then match any characters.

In fact, one thing I just thought of to try is to match:

(?!HtmlTagRegularExpression).*?

If that works, let me know.
 
Yeah. As I was driving home I realized that it wouldn't work, but it's too
late.

Roughly what that is doing is checking at every character to make sure it's
not followed by an HTML tag. However, if you're inside an HTML tag, then the
negative lookahead wouldn't stop you from continuing to match. I'll thing
about this more.

I've written an HTML parser using regexes by keeping track of the index of
each match and then grabbing everything between matches. However, in
general, using something like the HTML Agility Pack is a better solution as
XPath is great at finding specific nodes.

But for regex learning purposes, you picked a great problem.
 
Hi
I want to match all strings not containing a specific expression soo when I
found your reply and regex "(?!expression).*" I thought great this is what I
want but I cant get it to work.
It matches everything as far as i can see. Tried with the following vb.net
code:

Dim TestRegex As Regex = New Regex("(?!exp).*")
Dim S As String = "exp"
Dim M As Match = TestRegex.Match(S)

If M.Success Then
Debug.Write("Match")
Else
Debug.Write("No Match")
End If
 
I was waiting for someone else to answer this but it has been long enough
with no response so here is my try.

If you want to match an entire string then you must specify it in the
Regular expression. Your expression "(?!exp).*" has no bounds in the string
so it starts at the beginning of the string and looks ahead for "exp". "|"
is the position of the Regex search.

"|exp some other stuff" - match fails since it sees "exp" ahead
If exp is there then the match fails and the index is incremented by 1 and
again it looks ahead for exp.
"e|xp some other stuff" - match successful since it sees "xp " ahead
"xp some other stuff" - returned value (.*)

Look at this again for "exp" not at the start of the string.

Regex starts at the begining and looks ahead for "exp".
"|some other exp stuff" - match successful since it sees "som" ahead
"some other exp stuff" - returned value (.*)

So you see you can not use "(?!exp).*" to match strings that do not contain
"exp".

Hope this helps

--Robby
 
Back
Top