Regex doubt

Sriram Krishnan · Oct 26, 2004

I'm doing some search-engine related work and want to match the actual
content of a html page (i.e any character which is not between a < and a >).
I first wrote

(?:\<.*?>) (?<content>.*?) <?:\<.*?>)

which basically says match any text between a opening and a closing tag. The
problem with this is that you almost always have nested tags.This exp is
braindead as it chokes on nested tags. So this would match something like <a
href=""><img/></a>/ (this would match the '<img/>' part).

So I came up with

(?![\<|>].*?>)

But the problem with this negative-look ahead is that it doesnt advance
beyond the first negation - it just stops there.I have a feeling that saying
what I *dont* want is the way to go.

I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which says
something like - 'match any text that doesnt match this expression'. Or is
there any way to do reursive regex matching - that is , within a pattern,
match the pattern itself?In that case, the first pattern could be made to
work as I could have a recursive call inside the (?<content>) pattern which
keeps going down until you dont have any more nested tags

Thannks in advance

Ben Lucas · Oct 26, 2004

A good friend of mine recently posted an article on his blog regarding using
regular expressions to match HTML. His article can be found at:

http://haacked.com/archive/2004/10/25/1471.aspx

Hope this helps.

--
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com

Sriram Krishnan said:
I'm doing some search-engine related work and want to match the actual
content of a html page (i.e any character which is not between a < and a

). I first wrote

Click to expand...

(?:\<.*?>) (?<content>.*?) <?:\<.*?>)

which basically says match any text between a opening and a closing tag.
The problem with this is that you almost always have nested tags.This exp
is braindead as it chokes on nested tags. So this would match something
like <a href=""><img/></a>/ (this would match the '<img/>' part).

So I came up with

(?![\<|>].*?>)

But the problem with this negative-look ahead is that it doesnt advance
beyond the first negation - it just stops there.I have a feeling that
saying what I *dont* want is the way to go.

I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which
says something like - 'match any text that doesnt match this expression'.
Or is there any way to do reursive regex matching - that is , within a
pattern, match the pattern itself?In that case, the first pattern could be
made to work as I could have a recursive call inside the (?<content>)
pattern which keeps going down until you dont have any more nested tags

Thannks in advance

Sriram Krishnan · Oct 26, 2004

Nice article - but it doesnt do what I want. His article is on how to match
tags - my doubt is how to match all the other non-tag content. And I really
dont want to use HtmlAgilitypack - I'm learning RegEx and want to figure how
to do this. So what I'm looking for is something like "match all the text
that doesnt match that expression"

--
Sriram Krishnan

http://www.dotnetjunkies.com/weblog/sriram

Ben Lucas said:
A good friend of mine recently posted an article on his blog regarding
using regular expressions to match HTML. His article can be found at:

http://haacked.com/archive/2004/10/25/1471.aspx

Hope this helps.

--
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com

Sriram Krishnan said:

I'm doing some search-engine related work and want to match the actual
content of a html page (i.e any character which is not between a < and a

). I first wrote

Click to expand...

(?:\<.*?>) (?<content>.*?) <?:\<.*?>)

which basically says match any text between a opening and a closing tag.
The problem with this is that you almost always have nested tags.This exp
is braindead as it chokes on nested tags. So this would match something
like <a href=""><img/></a>/ (this would match the '<img/>' part).

So I came up with

(?![\<|>].*?>)

But the problem with this negative-look ahead is that it doesnt advance
beyond the first negation - it just stops there.I have a feeling that
saying what I *dont* want is the way to go.

I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which
says something like - 'match any text that doesnt match this
expression'. Or is there any way to do reursive regex matching - that is
, within a pattern, match the pattern itself?In that case, the first
pattern could be made to work as I could have a recursive call inside the
(?<content>) pattern which keeps going down until you dont have any more
nested tags

Thannks in advance

Click to expand...

Ben Lucas · Oct 26, 2004

Sriram,

I am not a Regular Expressions expert myself, but I ran this by Phil, the
author of the article I sent you. This was his response:

"Simplest option is to to a Regex.Replace with my expression and replace
with empty string. Then what you have left is non-tag content. Sometimes
the best use of Regexp is to match what you don't want and get rid of it."

--
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com

Sriram Krishnan said:
Nice article - but it doesnt do what I want. His article is on how to
match tags - my doubt is how to match all the other non-tag content. And I
really dont want to use HtmlAgilitypack - I'm learning RegEx and want to
figure how to do this. So what I'm looking for is something like "match
all the text that doesnt match that expression"

--
Sriram Krishnan

http://www.dotnetjunkies.com/weblog/sriram

Ben Lucas said:

A good friend of mine recently posted an article on his blog regarding
using regular expressions to match HTML. His article can be found at:

http://haacked.com/archive/2004/10/25/1471.aspx

Hope this helps.

--
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com

Sriram Krishnan said:

I'm doing some search-engine related work and want to match the actual
content of a html page (i.e any character which is not between a < and a
). I first wrote

(?:\<.*?>) (?<content>.*?) <?:\<.*?>)

which basically says match any text between a opening and a closing tag.
The problem with this is that you almost always have nested tags.This
exp is braindead as it chokes on nested tags. So this would match
something like <a href=""><img/></a>/ (this would match the '<img/>'
part).

So I came up with

(?![\<|>].*?>)

But the problem with this negative-look ahead is that it doesnt advance
beyond the first negation - it just stops there.I have a feeling that
saying what I *dont* want is the way to go.

I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which
says something like - 'match any text that doesnt match this
expression'. Or is there any way to do reursive regex matching - that is
, within a pattern, match the pattern itself?In that case, the first
pattern could be made to work as I could have a recursive call inside
the (?<content>) pattern which keeps going down until you dont have any
more nested tags

Thannks in advance

Click to expand...

Click to expand...

Sriram Krishnan · Oct 26, 2004

Thanks

Sometimes the answer stares at you in the face and you dont
bother to see it. I'll use this for now.

Out of academic interest, is there a way to do that - say that 'match all
text that doesnt match this pattern'?

Thanks once again

--
Sriram Krishnan

http://www.dotnetjunkies.com/weblog/sriram

Ben Lucas said:
Sriram,

I am not a Regular Expressions expert myself, but I ran this by Phil, the
author of the article I sent you. This was his response:

"Simplest option is to to a Regex.Replace with my expression and replace
with empty string. Then what you have left is non-tag content. Sometimes
the best use of Regexp is to match what you don't want and get rid of it."

--
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com

Sriram Krishnan said:

Nice article - but it doesnt do what I want. His article is on how to
match tags - my doubt is how to match all the other non-tag content. And
I really dont want to use HtmlAgilitypack - I'm learning RegEx and want
to figure how to do this. So what I'm looking for is something like
"match all the text that doesnt match that expression"

--
Sriram Krishnan

http://www.dotnetjunkies.com/weblog/sriram

Ben Lucas said:

A good friend of mine recently posted an article on his blog regarding
using regular expressions to match HTML. His article can be found at:

http://haacked.com/archive/2004/10/25/1471.aspx

Hope this helps.

--
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com

I'm doing some search-engine related work and want to match the actual
content of a html page (i.e any character which is not between a < and
a >). I first wrote

(?:\<.*?>) (?<content>.*?) <?:\<.*?>)

which basically says match any text between a opening and a closing
tag. The problem with this is that you almost always have nested
tags.This exp is braindead as it chokes on nested tags. So this would
match something like <a href=""><img/></a>/ (this would match the
'<img/>' part).

So I came up with

(?![\<|>].*?>)

But the problem with this negative-look ahead is that it doesnt advance
beyond the first negation - it just stops there.I have a feeling that
saying what I *dont* want is the way to go.

I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which
says something like - 'match any text that doesnt match this
expression'. Or is there any way to do reursive regex matching - that
is , within a pattern, match the pattern itself?In that case, the first
pattern could be made to work as I could have a recursive call inside
the (?<content>) pattern which keeps going down until you dont have any
more nested tags

Thannks in advance

Click to expand...

Click to expand...

Guest · Oct 26, 2004

Hi, This is Phil.

The problem with trying to specifically match HTML content is that HTML is a
nested structure. So imagine the following HTML string:

textblah

Trying to match text there is very difficult. Regular expressions aren't
well suited for matching nested structures. Microsoft did innovate a syntax
for matching nested structures in .NET, but it's not standard Regex (not that
anything is standard) and is difficult to understand.

If you try a negation approach, you can try using negative lookahead.
Something like:

(?!expression).*

This basically says look ahead without consuming any characters and if the
following sequence does NOT match the expression, then match any characters.

In fact, one thing I just thought of to try is to match:

(?!HtmlTagRegularExpression).*?

If that works, let me know.

Sriram Krishnan · Oct 26, 2004

In fact, one thing I just thought of to try is to match:

(?!HtmlTagRegularExpression).*?

Already tried that :-)

But unfortunately it doesnt seem to work.

Guest · Oct 26, 2004

Yeah. As I was driving home I realized that it wouldn't work, but it's too
late.

Roughly what that is doing is checking at every character to make sure it's
not followed by an HTML tag. However, if you're inside an HTML tag, then the
negative lookahead wouldn't stop you from continuing to match. I'll thing
about this more.

I've written an HTML parser using regexes by keeping track of the index of
each match and then grabbing everything between matches. However, in
general, using something like the HTML Agility Pack is a better solution as
XPath is great at finding specific nodes.

But for regex learning purposes, you picked a great problem.

Guest · Nov 29, 2004

Hi
I want to match all strings not containing a specific expression soo when I
found your reply and regex "(?!expression).*" I thought great this is what I
want but I cant get it to work.
It matches everything as far as i can see. Tried with the following vb.net
code:

Dim TestRegex As Regex = New Regex("(?!exp).*")
Dim S As String = "exp"
Dim M As Match = TestRegex.Match(S)

If M.Success Then
Debug.Write("Match")
Else
Debug.Write("No Match")
End If

Robby · Dec 1, 2004

I was waiting for someone else to answer this but it has been long enough
with no response so here is my try.

If you want to match an entire string then you must specify it in the
Regular expression. Your expression "(?!exp).*" has no bounds in the string
so it starts at the beginning of the string and looks ahead for "exp". "|"
is the position of the Regex search.

"|exp some other stuff" - match fails since it sees "exp" ahead
If exp is there then the match fails and the index is incremented by 1 and
again it looks ahead for exp.
"e|xp some other stuff" - match successful since it sees "xp " ahead
"xp some other stuff" - returned value (.*)

Look at this again for "exp" not at the start of the string.

Regex starts at the begining and looks ahead for "exp".
"|some other exp stuff" - match successful since it sees "som" ahead
"some other exp stuff" - returned value (.*)

So you see you can not use "(?!exp).*" to match strings that do not contain
"exp".

Hope this helps

--Robby

regex help	1	Jun 15, 2007
Match Blank Characters outside Multiline Comments	1	Apr 18, 2005
Regex in C#	4	Jun 2, 2014
Regex Woes	2	Oct 13, 2004
Regex help needed	1	Apr 4, 2010
Simple regex question!	2	Apr 6, 2006
Regex Favorite parser	1	Jun 6, 2007
I need an workaround for Regex limitation	1	Apr 14, 2006

Regex doubt

Sriram Krishnan

Ben Lucas

Sriram Krishnan

Ben Lucas

Sriram Krishnan

Guest

Sriram Krishnan

Guest

Guest

Robby

Ask a Question

Similar Threads