Regex question

  • Thread starter Thread starter Du Dang
  • Start date Start date
D

Du Dang

Text:
=====================
<script1>
***stuff A
</script1>

***more stuff

<script2>
***stuff B
</script2>

=====================

Regex:
<script>[\s\S]+</script>

I use "[\s\S]" intead of "." because there is newline char within the text.


The regex above will give me the match from <script1> to </script2>
instead of two separated matches.

How do I extract <script1> ... </script1> and <script2> ... </script2> as a
separted matches?

Thanks,

Du
 
Text:
=====================
<script1>
***stuff A
</script1>

***more stuff

<script2>
***stuff B
</script2>

=====================

Regex:
<script>[\s\S]+</script>

I use "[\s\S]" intead of "." because there is newline char
within the text.


The regex above will give me the match from <script1> to
</script2> instead of two separated matches.

How do I extract <script1> ... </script1> and <script2> ...
</script2> as a separted matches?

Du,

You can use the "." character to match a newline if you use the
RegexOptions.Singleline option.

Try this:


string inputText = @"
<script1>
***stuff A
</script1>

***more stuff

<script2>
***stuff B
</script2>";

string regex = @"<script\d>(?<contents>.*?)</script\d>";

MatchCollection mc = Regex.Matches(inputText, regex,
RegexOptions.Singleline |
RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace);

foreach (Match m in mc)
Console.WriteLine(m.Groups["contents"].ToString());


Hope this helps.

Chris.
 
In addition, you can use a named backreference to make sure that you don't
match anything like "<script1>....</script2>":

<script(?<num>\d+)>(?<contents>.*?)</script\k<num>>


Brian Davis
http://www.knowdotnet.com



Chris R. Timmons said:
Text:
=====================
<script1>
***stuff A
</script1>

***more stuff

<script2>
***stuff B
</script2>

=====================

Regex:
<script>[\s\S]+</script>

I use "[\s\S]" intead of "." because there is newline char
within the text.


The regex above will give me the match from <script1> to
</script2> instead of two separated matches.

How do I extract <script1> ... </script1> and <script2> ...
</script2> as a separted matches?

Du,

You can use the "." character to match a newline if you use the
RegexOptions.Singleline option.

Try this:


string inputText = @"
<script1>
***stuff A
</script1>

***more stuff

<script2>
***stuff B
</script2>";

string regex = @"<script\d>(?<contents>.*?)</script\d>";

MatchCollection mc = Regex.Matches(inputText, regex,
RegexOptions.Singleline |
RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace);

foreach (Match m in mc)
Console.WriteLine(m.Groups["contents"].ToString());


Hope this helps.

Chris.
 
Thanks Chris, it works like a charm.

//(?<contents>.*?)
one thing I don't understand .. why the second question mark is there?
my understanding of naming a regex is (?<name_here>expression_here)

I tried to removed the second question mark and the expression stop working

Thanks again for your help,

Du

Chris R. Timmons said:
Text:
=====================
<script1>
***stuff A
</script1>

***more stuff

<script2>
***stuff B
</script2>

=====================

Regex:
<script>[\s\S]+</script>

I use "[\s\S]" intead of "." because there is newline char
within the text.


The regex above will give me the match from <script1> to
</script2> instead of two separated matches.

How do I extract <script1> ... </script1> and <script2> ...
</script2> as a separted matches?

Du,

You can use the "." character to match a newline if you use the
RegexOptions.Singleline option.

Try this:


string inputText = @"
<script1>
***stuff A
</script1>

***more stuff

<script2>
***stuff B
</script2>";

string regex = @"<script\d>(?<contents>.*?)</script\d>";

MatchCollection mc = Regex.Matches(inputText, regex,
RegexOptions.Singleline |
RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace);

foreach (Match m in mc)
Console.WriteLine(m.Groups["contents"].ToString());


Hope this helps.

Chris.
 
Hi Brian, thanks for helping out!!!

regard,

Du

Brian Davis said:
In addition, you can use a named backreference to make sure that you don't
match anything like "<script1>....</script2>":

<script(?<num>\d+)>(?<contents>.*?)</script\k<num>>


Brian Davis
http://www.knowdotnet.com



Chris R. Timmons said:
Text:
=====================
<script1>
***stuff A
</script1>

***more stuff

<script2>
***stuff B
</script2>

=====================

Regex:
<script>[\s\S]+</script>

I use "[\s\S]" intead of "." because there is newline char
within the text.


The regex above will give me the match from <script1> to
</script2> instead of two separated matches.

How do I extract <script1> ... </script1> and <script2> ...
</script2> as a separted matches?

Du,

You can use the "." character to match a newline if you use the
RegexOptions.Singleline option.

Try this:


string inputText = @"
<script1>
***stuff A
</script1>

***more stuff

<script2>
***stuff B
</script2>";

string regex = @"<script\d>(?<contents>.*?)</script\d>";

MatchCollection mc = Regex.Matches(inputText, regex,
RegexOptions.Singleline |
RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace);

foreach (Match m in mc)
Console.WriteLine(m.Groups["contents"].ToString());


Hope this helps.

Chris.
 
Thanks Chris, it works like a charm.

//(?<contents>.*?)
one thing I don't understand .. why the second question mark is
there? my understanding of naming a regex is
(?<name_here>expression_here)

I tried to removed the second question mark and the expression
stop working

Quantifiers like + and * are "greedy". They will match as many
characters as they can. The question mark makes the quantifiers non-
greedy, so they match the minimum number of characters required for a
successful match.

A utility like Expresso
(http://www12.brinkster.com/ultrapico/Expresso.htm) can help in
understanding how greedy and non-greedy quantifiers behave.

Hope this helps.

Chris.
 
Back
Top