Regular expressions question

  • Thread starter Thread starter Ioannis Vranos
  • Start date Start date
I

Ioannis Vranos

What is the difference among the following. Please correct me if I am
wrong (this is not a homework, I am just checking System::Regex these
days and have not figured out everything yet).


"a*": As far as I know it means nothing (=it matches even the empty
string ""), or a string consisting of character 'a' followed by 0 or
more characters(?).


"a+": It means a string (=up to the first whitespace as is the case of
all the regular expressions) consisting of character 'a' followed by 0
or more characters.

"a.*": Character 'a' followed by one and 0 or more characters.

"a.+": Character 'a' followed by one and 0 or more characters (same
effect as the above?).


"a.": Character 'a' followed by one character.
 
Ioannis said:
What is the difference among the following. Please correct me if I am
wrong (this is not a homework, I am just checking System::Regex these
days and have not figured out everything yet).


"a*": As far as I know it means nothing (=it matches even the empty
string ""), or a string consisting of character 'a' followed by 0 or
more characters(?).
It means 0 or more "a". So it match an empty string, "a", "aaaaa", but
not "bced".
"a+": It means a string (=up to the first whitespace as is the case of
all the regular expressions)
Why? A regex can contains whitespaces!!
consisting of character 'a' followed by 0
or more characters.
It means one or more instances of "a". So it matches "a", "aaaa", but
not empty string, neither "vfef";
"a.*": Character 'a' followed by one and 0 or more characters.
a" said:
"a.+": Character 'a' followed by one and 0 or more characters (same
effect as the above?).
a" said:
"a.": Character 'a' followed by one character.
Yes, that's right.

Arnaud
MVP - VC
 
It means 0 or more "a". So it match an empty string, "a", "aaaaa", but
not "bced".


However under VC++ 2005 Express February 2005 CTP we get for the code:


// This is the main project file for VC++ application project
// generated using an Application Wizard.

#include "stdafx.h"

using namespace System;

int main()
{
using namespace System::Text::RegularExpressions;

String ^s="bcdefghij";

Console::WriteLine(Regex::IsMatch(s, "a*"));
}


True
Press any key to continue . . .


Why? A regex can contains whitespaces!!


What I mean is that whitespaces are considered another character class
from alphabetic characters.
 
Ioannis said:
However under VC++ 2005 Express February 2005 CTP we get for the code:

int main()
{
using namespace System::Text::RegularExpressions;

String ^s="bcdefghij";

Console::WriteLine(Regex::IsMatch(s, "a*"));
}

Yes : I said that the regex matches an empty string : So here you match
the empty string at the beginning of "bcdefghij".
What you failed to see is that the IsMatch method try to find a match
inside the given string, it doesn't check that the full string is
matched. Use the Regex.Match method to get the Match object : you'll
see that it matches an empty string (length=0) at index 0 from the
input string.

If you use Regex.Mathes (to get all the matches), you'll see that in
fact it find a 0 length match at each position of the input string, so
you get 10 matches!
What I mean is that whitespaces are considered another character class
from alphabetic characters.

Yes, but "." match anything, including whitespaces.

Arnaud
MVP - VC
 
Yes : I said that the regex matches an empty string : So here you match
the empty string at the beginning of "bcdefghij".


I am not sure I understood this. There is no empty string in there.


What you failed to see is that the IsMatch method try to find a match
inside the given string, it doesn't check that the full string is
matched.


If I wanted the entire string to be matched, shouldn't I use
Regex::IsMatch(s, "^a*$")?


Use the Regex.Match method to get the Match object : you'll
see that it matches an empty string (length=0) at index 0 from the
input string.

If you use Regex.Mathes (to get all the matches), you'll see that in
fact it find a 0 length match at each position of the input string, so
you get 10 matches!


So in essence it matches everything and is equivalent to
Regex::IsMatch(s, ".*")?


BTW why does Regex::IsMatch(s, "*") crash?


Unhandled Exception: System.ArgumentException: parsing "*" - Quantifier
{x,y} fo
llowing nothing.
at System.Text.RegularExpressions.RegexParser.ScanRegex()
at System.Text.RegularExpressions.RegexParser.Parse(String re,
RegexOptions o
p)
at System.Text.RegularExpressions.Regex..ctor(String pattern,
RegexOptions op
tions, Boolean useCache)
at System.Text.RegularExpressions.Regex.IsMatch(String input, String
pattern)

at main() in c:\documents and settings\administrator\my
documents\visual stud
io\projects\test\test\test.cpp:line 14
Press any key to continue . . .


Yes, but "." match anything, including whitespaces.

Thanks, I did not know that.
 
Ioannis said:
I am not sure I understood this. There is no empty string in there.

The empty string is a substring of every string, and there are n
different substring calls that will produce the empty string for an n
character string.

Yes, IsMatch sees if any substring of the string matches the regex.
If I wanted the entire string to be matched, shouldn't I use
Regex::IsMatch(s, "^a*$")?
Yes.




So in essence it matches everything and is equivalent to
Regex::IsMatch(s, ".*")?

"a*"? For IsMatch, yes they are equivalent, but as RegExes, they are
not. If you have the string:

"abaabb"

then ".*" will match:
"" 6x
a 3x
ab 2x
aba 1x
abaa 2x
etc.

whereas
"a*" will match:
"" 6x
"a" 3x
"aa" 1x
etc.

Matching isn't just a yes/no (unless you use IsMatch) - the regex
matches against some substring of the string.
BTW why does Regex::IsMatch(s, "*") crash?

"*" is not a valid Regex. 0 to many of what? Similarly "+" and "{0,4}"
are not valid.

(apologies for any misinformation, regexp is not a major area of
expertise for me)

Tom
 
Ioannis said:
I am not sure I understood this. There is no empty string in there.

Yes there are many! : there is an empty string at index 0, another at index
1, another at index 2, etc... This si true for whatever string...
If I wanted the entire string to be matched, shouldn't I use
Regex::IsMatch(s, "^a*$")?

Yes, but this is a rather useless regex (as is "a*) : a regex that matches
the empty string doesn't make much sense, unles you filter the matches
afterwards : say, keep only matches more than x characters long. But in that
case, you'd better write a regex that does this filtering directly.
So in essence it matches everything and is equivalent to
Regex::IsMatch(s, ".*")?

As Tom explained, it is a bit more complex. In order to experiment, I
suggest you display all the Matches from both regexes on a given input
string.
BTW why does Regex::IsMatch(s, "*") crash?


Unhandled Exception: System.ArgumentException: parsing "*" -
Quantifier {x,y} fo
llowing nothing.
The error description seems quite clear, no? "*" is a quantifier : it
specifies "0 to n instances of the token before it" : There is nothing
before the quantifier in your regex, so it is an invalid regex.

Arnaud
MVP - VC
 
yes, "*" is an illegal regular expression, it represents nothing. these
quantifiers, as Tom and Arnaud told, should follow something. * matches
0-n of preceding item (string or character group, or character,
whatever). bare * itself is meaningless, literally, "illegal". ? also
illegal, +, and {}. these are illegal, if they do not follow anything.
"*a" is also wrong.
 
Back
Top