Regex problems

  • Thread starter Thread starter Lloyd Sheen
  • Start date Start date
L

Lloyd Sheen

I just bought Mastering Regular Expression since I have had no luck with any
docs available on the web. (MSDN is totally useless in this area).

So I am using the Expresso app as a workbench to learn with. Now the
problem occurs at the very beginning of the learning process.

First it appears that there is no "standard" for regex (or MS has their own
standard as usual).

First expression causes a problem I cannot explain. It is a regex to find
double double words. The expression is ([a-zA-Z]*) +\1
and the searched text is "This is an example of a double word word for an
example:

The result has the two values you would expect (I am not checking the
beginning of the word) plus each space as a match. My evaluation of the
expression is as follows:

1. ([a-zA-Z]*) should match any number of alpha characters
2. one space
3. \1 should match what was found in first steps

I have no idea how those spaces show as matches. If space is from a-z or
A-Z I would be very surprised.

HELP

Lloyd Sheen
 
The double word expression should be:

\b([a-zA-Z]+) +\1\b

The original expression has a couple of flaws. First, it does not make sure
that the first "word" is actually a word. When you run the expression on
the test string, you will find that it claims "is" is a repeated word.
Obviously, "is" is not a repeated word in the string, but "This is" contains
"is is". This problem is solved by placing "\b" in front of the group and
after the backreference. The "\b" is a word boundary, so it will not match
the "is" in "This". The reason why you got the spaces when using the
original expression is because the "*" was used instead of "+". Using the
"*" means any number, including zero. So there was a match at every space
in the test string (0 alpha-characters, followed by a space, followed by 0
alpha chars).

Learning regular expressions is not an overnight thing, but I think that the
reward for learning them is worth the cost in time and frustration. You
definitely bought the right book for it. Also, feel free to check out the
articles I have written on using Regex in .NET at http://www.knowdotnet.com.


Brian Davis
http://www.knowdotnet.com
 
Thanks for the help. You are certainly correct that regex is not the most
easy thing to learn. The book is a great help using real world examples as
opposed to most of the articles I have seen which have examples that are of
very little use.

Lloyd Sheen

Brian Davis said:
The double word expression should be:

\b([a-zA-Z]+) +\1\b

The original expression has a couple of flaws. First, it does not make sure
that the first "word" is actually a word. When you run the expression on
the test string, you will find that it claims "is" is a repeated word.
Obviously, "is" is not a repeated word in the string, but "This is" contains
"is is". This problem is solved by placing "\b" in front of the group and
after the backreference. The "\b" is a word boundary, so it will not match
the "is" in "This". The reason why you got the spaces when using the
original expression is because the "*" was used instead of "+". Using the
"*" means any number, including zero. So there was a match at every space
in the test string (0 alpha-characters, followed by a space, followed by 0
alpha chars).

Learning regular expressions is not an overnight thing, but I think that the
reward for learning them is worth the cost in time and frustration. You
definitely bought the right book for it. Also, feel free to check out the
articles I have written on using Regex in .NET at http://www.knowdotnet.com.


Brian Davis
http://www.knowdotnet.com




I just bought Mastering Regular Expression since I have had no luck with any
docs available on the web. (MSDN is totally useless in this area).

So I am using the Expresso app as a workbench to learn with. Now the
problem occurs at the very beginning of the learning process.

First it appears that there is no "standard" for regex (or MS has their own
standard as usual).

First expression causes a problem I cannot explain. It is a regex to find
double double words. The expression is ([a-zA-Z]*) +\1
and the searched text is "This is an example of a double word word for an
example:

The result has the two values you would expect (I am not checking the
beginning of the word) plus each space as a match. My evaluation of the
expression is as follows:

1. ([a-zA-Z]*) should match any number of alpha characters
2. one space
3. \1 should match what was found in first steps

I have no idea how those spaces show as matches. If space is from a-z or
A-Z I would be very surprised.

HELP

Lloyd Sheen
 
Back
Top