Regex question - alternation

  • Thread starter Thread starter Jon Shemitz
  • Start date Start date
J

Jon Shemitz

I have a regex that matches nested parens (it's a slightly rewritten
version of the one in Friedl's "Mastering Regular Expressions, 2e"):

#[IgnorePatternWhitespace]
\( # a literal (

(?: # non-capture group
\( (?<Stack>) # on nested (, push empty capture
| \) (?<-Stack>) # on nested ), pop empty capture
| [^()] # anything except ( or )
)* # any number of chars between parens

(?(Stack) # if stack not empty:
^ # then, match beginning of string (ie, fail)
| \) ) # else, match literal )

Now, this works just fine, even on empty parens "()". What I don't
understand is why I have to use /[^()]/ instead of "." in the clause
that matches anything except parens.

As I understand alternation, the left alternative is matched first. If
it matches, the right alternative is skipped. (This certainly seems to
explain the capture behavior when I match /that|th([a-z])t/ against
"that thought": on the first Match, the second Group fails to match;
while on the second Match, the second Group matches "ough".)

Why, then, doesn't /./ work instead of /[^()]/? Since a left or right
paren should have already been matched ....
 
I think you are right if the expression you are matching is well formed.
Then the parens should be matched first as they are left in the alternation.
However, if you have a text like "(()" which shouldn't be matched at all,
the regex engine would probably give the second "("'s to a "." if that was
in the alternation.

Hope this is in any way understandable...

Niki
 
I think you are right if the expression you are matching is well formed.
Then the parens should be matched first as they are left in the alternation.
However, if you have a text like "(()" which shouldn't be matched at all,
the regex engine would probably give the second "("'s to a "." if that was
in the alternation.

Hope this is in any way understandable...

I'm afraid not. The regex matches the last two characters of "(()" -
the 1st character matches the literal /\(/; the 2nd character nests;
the 3rd pops. The regex is not satisfied, so it tries again, starting
at the 2nd character. The literal /\(/ matches; the *-ed non-capture
group doesn't match; the Stack is empty, so the literal /\)/ matches.
#[IgnorePatternWhitespace]
\( # a literal (

(?: # non-capture group
\( (?<Stack>) # on nested (, push empty capture
| \) (?<-Stack>) # on nested ), pop empty capture
| [^()] # anything except ( or )
)* # any number of chars between parens

(?(Stack) # if stack not empty:
^ # then, match beginning of string (ie, fail)
| \) ) # else, match literal )
 
Jon Shemitz said:
...
I'm afraid not. The regex matches the last two characters of "(()" -
the 1st character matches the literal /\(/; the 2nd character nests;
the 3rd pops. The regex is not satisfied, so it tries again, starting
at the 2nd character.

Before stepping to the next character, it will first try to match the next
item in the alternation. If you put "." in the alternation, unmatched parens
will be matched by it.
The literal /\(/ matches; the *-ed non-capture
group doesn't match; the Stack is empty, so the literal /\)/ matches.
#[IgnorePatternWhitespace]
\( # a literal (

(?: # non-capture group
\( (?<Stack>) # on nested (, push empty capture
| \) (?<-Stack>) # on nested ), pop empty capture
| [^()] # anything except ( or )

If you replace this "[^()]" with ".", the regex will match "(()". (I've
tried it in Expresso)

Did I get your question wrong? Do you get different results?

Niki
 
Niki said:
Before stepping to the next character, it will first try to match the next
item in the alternation.

This seems to be at the root of my issue. I am **under the
impression** that /a|b/ acts much like

(ThisChar == 'a') || (ThisChar == 'b')

- if the left side matches, the right side is ignored. (In the op, I
cited | matches that **do** seem to work that way.) Obviously there's
something wrong with my understanding, but I don't really understand
what you're saying here: why would the engine try to match the right
hand side of an alternation when the left side already matched?
If you replace this "[^()]" with ".", the regex will match "(()". (I've
tried it in Expresso)
Did I get your question wrong? Do you get different results?

No, I get the same results. What I don't understand is **why**. I
appreciate your replies, but they're not (yet?) helping me.
 
I think I see your problem now. As far as I remember, Friedl explained
backtracking quite well in the book you mentioned (don't have it here right
now, probably smoewhere in the NFA-chapter). I'll try to sum it up:
Consider the regex /col(o|ou)r/, let it match "colour". Step by step:
/col/ matches "col".
Next is the alternation /(o|ou|)/: as you would expect, the /o/ which is
left, matches first. However, as the alternation hasn't been completely
evaluated, the regex engine stores backtracking information - if the whole
expression fails to match, it will try the other elements of the
alternation.
The next item after the alternation is /r/, however, the next character in
the string at this point is "ur", so the expression would fail. Now, the
engine back-tracks: it tries the next possibility - which is the /ou/ in the
alternation - match. After that, the final /r/ can also be found - match.
Something quite similar happens if you match for /c.*r/ - on the first
attempt, the /.*/ part will "eat up" *all* the characters to the end of the
string, and the /r/ after it can't be matched any more. So the engine will
back-track every single character until it can match the /r/.

Has this been of more help?

Niki

Jon Shemitz said:
Niki said:
Before stepping to the next character, it will first try to match the next
item in the alternation.

This seems to be at the root of my issue. I am **under the
impression** that /a|b/ acts much like

(ThisChar == 'a') || (ThisChar == 'b')

- if the left side matches, the right side is ignored. (In the op, I
cited | matches that **do** seem to work that way.) Obviously there's
something wrong with my understanding, but I don't really understand
what you're saying here: why would the engine try to match the right
hand side of an alternation when the left side already matched?
If you replace this "[^()]" with ".", the regex will match "(()". (I've
tried it in Expresso)
Did I get your question wrong? Do you get different results?

No, I get the same results. What I don't understand is **why**. I
appreciate your replies, but they're not (yet?) helping me.
 
Niki said:
Next is the alternation /(o|ou|)/: as you would expect, the /o/ which is
left, matches first. However, as the alternation hasn't been completely
evaluated, the regex engine stores backtracking information - if the whole
expression fails to match, it will try the other elements of the
alternation.

Bingo! Thanks!
 
Back
Top