RegEx substring

  • Thread starter Thread starter bryanmig
  • Start date Start date
B

bryanmig

Ok I am new to RegEx and what I am trying to do is find a substring.

I have a string that constantly changes. This string is pulled from an
Atom feed from a blog. I need to strip the HTML formatting from this
string and just grab the inner text.

If this is my string: "<div>Hello my name is bryan and I am learning
regex!</div>"

I need to be able to just grab what is in between <div> and </div>

I thought this would work but it still grabs the div's code...

Regex: <div>.*?</div>

How can i modify this expression to eliminate the div's ?


Thanks
Bryan
 
(?![^<]*)<[^>]*?>

This matches all HTML markup. It is the opposite of what you want. If you
remove all text matched by this regular expression, what's left over is what
you want.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.
 
This may not help me, becuase the text I am parsing is code from a blog
and likely to include formatting tags. I would want to keep all the
formatting markup, whether it be style,s fonts, line breaks, etc. I
just need to eliminate the first div and last div.


Kevin said:
(?![^<]*)<[^>]*?>

This matches all HTML markup. It is the opposite of what you want. If you
remove all text matched by this regular expression, what's left over is what
you want.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.


Ok I am new to RegEx and what I am trying to do is find a substring.

I have a string that constantly changes. This string is pulled from an
Atom feed from a blog. I need to strip the HTML formatting from this
string and just grab the inner text.

If this is my string: "<div>Hello my name is bryan and I am learning
regex!</div>"

I need to be able to just grab what is in between <div> and </div>

I thought this would work but it still grabs the div's code...

Regex: <div>.*?</div>

How can i modify this expression to eliminate the div's ?


Thanks
Bryan
 
Not a problem.

(?<=<div[^>]*>).*?(?=</div>)

I'll explain:

This uses a positive LookBehind and a positive LookAhead. The LookBehind and
LookAhead are non-capturing expressions, which indicate that the Match must
be preceded by or followed by a certain pattern. The Matches in the
LookBehind and LookAhead are not captured. So, only the text between them
is.

In addition, a div may have attributes, so I added an expression to the
LookBehind, indicating that the opening div tag can have any characters in
it other than the '>' character, prior to the closing '>' character.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

What You Seek Is What You Get.

This may not help me, becuase the text I am parsing is code from a blog
and likely to include formatting tags. I would want to keep all the
formatting markup, whether it be style,s fonts, line breaks, etc. I
just need to eliminate the first div and last div.


Kevin said:
(?![^<]*)<[^>]*?>

This matches all HTML markup. It is the opposite of what you want. If you
remove all text matched by this regular expression, what's left over is
what
you want.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.


Ok I am new to RegEx and what I am trying to do is find a substring.

I have a string that constantly changes. This string is pulled from an
Atom feed from a blog. I need to strip the HTML formatting from this
string and just grab the inner text.

If this is my string: "<div>Hello my name is bryan and I am learning
regex!</div>"

I need to be able to just grab what is in between <div> and </div>

I thought this would work but it still grabs the div's code...

Regex: <div>.*?</div>

How can i modify this expression to eliminate the div's ?


Thanks
Bryan
 
Thanks a million, Kevin

That line of code was golden!
I appreciate your time and effort very much.

Thanks again,
Bryan
http://www.staga.net

---------------------------
Kevin said:
Not a problem.

(?<=<div[^>]*>).*?(?=</div>)

I'll explain:

This uses a positive LookBehind and a positive LookAhead. The LookBehind and
LookAhead are non-capturing expressions, which indicate that the Match must
be preceded by or followed by a certain pattern. The Matches in the
LookBehind and LookAhead are not captured. So, only the text between them
is.

In addition, a div may have attributes, so I added an expression to the
LookBehind, indicating that the opening div tag can have any characters in
it other than the '>' character, prior to the closing '>' character.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

What You Seek Is What You Get.

This may not help me, becuase the text I am parsing is code from a blog
and likely to include formatting tags. I would want to keep all the
formatting markup, whether it be style,s fonts, line breaks, etc. I
just need to eliminate the first div and last div.


Kevin said:
(?![^<]*)<[^>]*?>

This matches all HTML markup. It is the opposite of what you want. If you
remove all text matched by this regular expression, what's left over is
what
you want.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.


Ok I am new to RegEx and what I am trying to do is find a substring.

I have a string that constantly changes. This string is pulled from an
Atom feed from a blog. I need to strip the HTML formatting from this
string and just grab the inner text.

If this is my string: "<div>Hello my name is bryan and I am learning
regex!</div>"

I need to be able to just grab what is in between <div> and </div>

I thought this would work but it still grabs the div's code...

Regex: <div>.*?</div>

How can i modify this expression to eliminate the div's ?


Thanks
Bryan
 
Back
Top