Regular Expression and Multiple Group Captures

  • Thread starter Thread starter Amy L.
  • Start date Start date
A

Amy L.

I am having a hard time figuring out why this regular expression does not
have multiple captures for the group. When checking the regular expression
in a testing tool like "Expresso" it seems to work fine.

Input (All on one line - watch for wordwrap):
Student Results [weight=103]: SMITH=PASS JONES=WARN WRIGHT=WARN JOHNSON=WARN

Regular Expression:
(?<studentname>([\w-!\[\].,;:?/!@#$%^&*()<>{}|\~`'"=+-]+))=\w+(?:\s+|$)

Expected Output: A Group with multiple captures of: "SMITH", "JONES",
"WRIGHT".

Code I was using:

Regex myRegexTest = new Regex(
@"(?<studentname>([\w-!\[\].,;:?/!@#$%^&*()<>{}|\~`'""=+-]+))=\w+(?:\s+|$)",
RegexOptions.IgnoreCase | RegexOptions.Compiled ) ;
m = myRegexTest.Match( sText.ToString() ) ;
Console.WriteLine( "Groups Count: " + m.Groups.Count ) ;
Console.WriteLine( "Groups Capture 0: " + m.Groups[0].Captures.Count ) ;
Console.WriteLine( "Groups Capture 1: " + m.Groups[1].Captures.Count ) ;
Console.WriteLine( "Groups Capture 2: " + m.Groups[2].Captures.Count ) ;

When I look at the output I get 3 groups each with one capture. When I look
at whats captured I always end up with just "SMITH" and never the other two
names.

Any help would be greatly appreciated.
Amy.
 
Amy said:
I am having a hard time figuring out why this regular expression does not
have multiple captures for the group. When checking the regular expression
in a testing tool like "Expresso" it seems to work fine.

Input (All on one line - watch for wordwrap):
Student Results [weight=103]: SMITH=PASS JONES=WARN WRIGHT=WARN JOHNSON=WARN

Regular Expression:
(?<studentname>([\w-!\[\].,;:?/!@#$%^&*()<>{}|\~`'"=+-]+))=\w+(?:\s+|$)

At a quick glance, the problem is that the "studentname" group doesn't
have a quantifier behind it. You can't get multiple captures witout a
quantifier (*, +, ?) behind the group.

What you do get, currently, and what a regular expression tool might
show you, are multiple matches. The complete expression is matched more
than once to the input string and each of these matches has its own
"studentname" group, that's what you are probably seeing.

Now, two choices: Either you just evaluate the various matches in your
code (use the Matches method instead of Match to retrieve them all) or
you rewrite the expression to include a quantified group so that you'll
actually get multiple captures. In a simple case, like this:

Student\sResults.*?\:\s*(?<assignment>(?<studentname>[\w-!\[\].,;:?/!@#$%^&*()<>{}|\~`'"=+-]+)=\w+(?:\s+|$))*

This should give you two named groups "assignment" and "studentname",
each of which has multiple captures. Hope this helps!


Oliver Sturm
 
Thank you so much. After testing your regular expression I see the
difference in the tool on the difference of multiple matches versus multiple
captures are.

Do you have an opinion on what is more efficient - iterating through
multiple matches or iterating through multiple captures under one group?

Amy.


Oliver Sturm said:
Amy said:
I am having a hard time figuring out why this regular expression does not
have multiple captures for the group. When checking the regular
expression in a testing tool like "Expresso" it seems to work fine.

Input (All on one line - watch for wordwrap):
Student Results [weight=103]: SMITH=PASS JONES=WARN WRIGHT=WARN
JOHNSON=WARN

Regular Expression:
(?<studentname>([\w-!\[\].,;:?/!@#$%^&*()<>{}|\~`'"=+-]+))=\w+(?:\s+|$)

At a quick glance, the problem is that the "studentname" group doesn't
have a quantifier behind it. You can't get multiple captures witout a
quantifier (*, +, ?) behind the group.

What you do get, currently, and what a regular expression tool might show
you, are multiple matches. The complete expression is matched more than
once to the input string and each of these matches has its own
"studentname" group, that's what you are probably seeing.

Now, two choices: Either you just evaluate the various matches in your
code (use the Matches method instead of Match to retrieve them all) or you
rewrite the expression to include a quantified group so that you'll
actually get multiple captures. In a simple case, like this:

Student\sResults.*?\:\s*(?<assignment>(?<studentname>[\w-!\[\].,;:?/!@#$%^&*()<>{}|\~`'"=+-]+)=\w+(?:\s+|$))*

This should give you two named groups "assignment" and "studentname", each
of which has multiple captures. Hope this helps!


Oliver Sturm
--
omnibus ex nihilo ducendis sufficit unum
Spaces inserted to prevent google email destruction:
MSN oliver @ sturmnet.org Jabber sturm @ amessage.de
ICQ 27142619 http://www.sturmnet.org/blog
 
Amy said:
Thank you so much. After testing your regular expression I see the
difference in the tool on the difference of multiple matches versus multiple
captures are.

Do you have an opinion on what is more efficient - iterating through
multiple matches or iterating through multiple captures under one group?

I'm willing to have an opinion, but I can't really think of one :-)

Generally I would think that finding multiple captures may involve less
overhead in the regular expression engine, because it's an intrinsic
part of the algorithm, while finding multiple matches involves running
the expression against the input multiple times. But then this depends
on the implementation details and quality of the engine, and even in the
case where multiple captures are found, additional runs are made anyway
to look for additional matches, even if none are found.

I'd say that a carefully implemented engine shouldn't show much of a
difference between the two, but I wouldn't be surprised if many engines
did actually show quite a difference, depending on the pattern, the
input and probably other parameters. Might be interesting to do some
tests here with the .NET implementation ...


Oliver Sturm
 
clever code. If I ever inherit it, I will chuck it and replace it with a
simple set of parsing expressions.

You code has a really high "bus factor." That means that if you are ever
hit by a bus, your team is screwed.

just a head's up.
--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
 
Nick,

I would have to agree with you - the original code was implemented using a
simpler method of splitting the string and grabbing what we needed.
However, our dataset consists of multiple files that are easily over a gig
each and when you have to process roughly 30 at a time it takes a bit of
time. We looked to see if regular expression parsing was faster than what
we had implemented to begin with. Long story short it was not :)

Amy.

Nick Malik said:
clever code. If I ever inherit it, I will chuck it and replace it with a
simple set of parsing expressions.

You code has a really high "bus factor." That means that if you are ever
hit by a bus, your team is screwed.

just a head's up.
--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--
Amy L. said:
I am having a hard time figuring out why this regular expression does not
have multiple captures for the group. When checking the regular
expression in a testing tool like "Expresso" it seems to work fine.

Input (All on one line - watch for wordwrap):
Student Results [weight=103]: SMITH=PASS JONES=WARN WRIGHT=WARN
JOHNSON=WARN

Regular Expression:
(?<studentname>([\w-!\[\].,;:?/!@#$%^&*()<>{}|\~`'"=+-]+))=\w+(?:\s+|$)

Expected Output: A Group with multiple captures of: "SMITH", "JONES",
"WRIGHT".

Code I was using:

Regex myRegexTest = new Regex(
@"(?<studentname>([\w-!\[\].,;:?/!@#$%^&*()<>{}|\~`'""=+-]+))=\w+(?:\s+|$)",
RegexOptions.IgnoreCase | RegexOptions.Compiled ) ;
m = myRegexTest.Match( sText.ToString() ) ;
Console.WriteLine( "Groups Count: " + m.Groups.Count ) ;
Console.WriteLine( "Groups Capture 0: " + m.Groups[0].Captures.Count ) ;
Console.WriteLine( "Groups Capture 1: " + m.Groups[1].Captures.Count ) ;
Console.WriteLine( "Groups Capture 2: " + m.Groups[2].Captures.Count ) ;

When I look at the output I get 3 groups each with one capture. When I
look at whats captured I always end up with just "SMITH" and never the
other two names.

Any help would be greatly appreciated.
Amy.
 
Back
Top