Regular Expression, to use or not to use...

  • Thread starter Thread starter Tom
  • Start date Start date
T

Tom

I have struggled with the issue of whether or not to use Regular
Expressions for a long time now, and after implementing many text
manipulating solutions both ways, I've found that writing specialized
code instead of an RE is almost always the better solution. Here is
why....

RE's are complex. Sure it is one line of code, but it is on hell of a
line. Some of my RE remind me of the obfuscated code contest winners,
where one line of cryptic characters generates the 12 days of
Christmas. After you have worked out the complex RE it is almost
impossible for anyone(including yourself) to figure out what it is
really doing. Less code in not necessarily simple code. Plus it is not
like writing this "one line" of code is saving you any time. I have
had really complex RE's that took me hours to write that I needed
external tools to help me with. Sure it was only one line of code in
the end, but I could have written a page of easy to follow code in
half the time. Which brings me to my next point.

RE's are not debug-able. If I have a page of well written code I (or
anyone else) can easily step through it. When I send my 150 char RE
into the RE engine it is a black box. I am just left to wonder why it
didn't work.

Now I am sure you RE masters are out the just calling me a Dumb***,
Just learn REs better and you won't have any problems. But the reality
is that I can assume that anybody looking at my code is proficient in
C++/C#, because it is a C++/C# Program, I can't assume that they are a
RE Guru.

It's difficult to do some things in RE. RE's work great for some
search but not so well for others. I usually find myself having to get
a bunch of results back then doing some other processing on them get
the text out, which mean special code anyways.
Finally I don't know what you guy are saying about RE ever being
faster ever. In my experience RE's are slow, very slow. Like on the
order of 10 times slower then straight forward string parsing code.
For me this is the final kicker. Originally I went through all the
trouble off using RE's in all my complex text paring code because I
thought it would faster. This could be a result of .NET engine,
regardless I was really disappointed by this causing me to rollback a
lot of my solutions.

With all this said I still use REs sometimes, but only for really
simple string operations where I can come up with the expression in a
couple of seconds I don't care about performance and don't feel like
writing 5-10 lines of code. To me it is kind of like a quick hack.

Anyways that's my 2 bits.

Thomas
 
Tom said:
I have struggled with the issue of whether or not to use Regular
Expressions for a long time now, and after implementing many text
manipulating solutions both ways, I've found that writing specialized
code instead of an RE is almost always the better solution. Here is
why....
RE's are complex.

Yesterday somebody asked a string matching question of the C# ng. I've
copied two of the answers here:
The C# version:

// The start Position
int Start = InputName.IndexOf(":")+1;
// The end position
int End = InputName.IndexOf(" T");
// The output
string OutputName = InputName.Substring(Start,End - Start);
// Clear off leading / trailing spaces
OutputName = OutputName.Trim();


The Regex version:
Match m = Regex.Match(inputString, "From: (.*) To:");
if (m.Success)
OutputName = m.Groups[1];

I may be a bad C# programmer, but if I read the former piece of source code
I wouldn't have a clue of what it could be doing. Although every line's
commented, and the code is rather simple.
The RE version on the other hand seems pretty self-explaining to me.
Plus it is not like writing this "one line" of code is saving you any
time.

It does.
Look at the above code snippets and think about how much time the poor
plain-C#-guy will spend debugging, trying to figure out why people like
"James T. Kirk" don't get their email, and why his program sometimes simply
crashes...
I have
had really complex RE's that took me hours to write that I needed
external tools to help me with. Sure it was only one line of code in
the end, but I could have written a page of easy to follow code in
half the time. Which brings me to my next point.

Interesting. Could you post such a RE, and the plain code?
RE's are not debug-able. If I have a page of well written code I (or
anyone else) can easily step through it.

Putting it in a RegEx testing program like Expresso and removing parts of it
usually does a similar job.
When I send my 150 char RE
into the RE engine it is a black box. I am just left to wonder why it
didn't work.

On the other hand, you usually can't copy C# code to some other program and
simply run it there, because it is coupled with other parts of the project.
...
It's difficult to do some things in RE. RE's work great for some
search but not so well for others.

That applies to pretty much any tool I can think of...
...
Finally I don't know what you guy are saying about RE ever being
faster ever. In my experience RE's are slow, very slow. Like on the
order of 10 times slower then straight forward string parsing code.

Depends on what you're doing, and how you're doing it. If you are searching
for a long pattern like "Thomas Jefferson" in a long string (> 100
characters), the RE is about 10 times faster than a culture-invariant
IndexOf.
For me this is the final kicker. Originally I went through all the
trouble off using RE's in all my complex text paring code because I
thought it would faster. This could be a result of .NET engine,
regardless I was really disappointed by this causing me to rollback a
lot of my solutions.

If it's in the 10% of the code that use 90% of the time, you should of
course choose the fastest code you can get.
With all this said I still use REs sometimes, but only for really
simple string operations where I can come up with the expression in a
couple of seconds I don't care about performance and don't feel like
writing 5-10 lines of code.

So what?
To me it is kind of like a quick hack.

A quick hack with full error handling, that's usually much more stable and
robust that those untested "5-10 lines of code"...

Niki
 
Niki Estner said:
Tom said:
I have struggled with the issue of whether or not to use Regular
Expressions for a long time now, and after implementing many text
manipulating solutions both ways, I've found that writing specialized
code instead of an RE is almost always the better solution. Here is
why....
RE's are complex.

Yesterday somebody asked a string matching question of the C# ng. I've
copied two of the answers here:
The C# version:

// The start Position
int Start = InputName.IndexOf(":")+1;
// The end position
int End = InputName.IndexOf(" T");
// The output
string OutputName = InputName.Substring(Start,End - Start);
// Clear off leading / trailing spaces
OutputName = OutputName.Trim();


The Regex version:
Match m = Regex.Match(inputString, "From: (.*) To:");
if (m.Success)
OutputName = m.Groups[1];

I may be a bad C# programmer, but if I read the former piece of source code
I wouldn't have a clue of what it could be doing. Although every line's
commented, and the code is rather simple.
The RE version on the other hand seems pretty self-explaining to me.
Plus it is not like writing this "one line" of code is saving you any
time.

It does.
Look at the above code snippets and think about how much time the poor
plain-C#-guy will spend debugging, trying to figure out why people like
"James T. Kirk" don't get their email, and why his program sometimes simply
crashes...

Okay there are various problems with this example.

First of all this is one of the extremely trival examples that I might
actually use a Regular expression for. Consider a RE like this

^(?:(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26
])|(?:(?:16|[2468][048]|[3579][26])00)))(\/|-|\.)(?:0?2\1(?:
29))$)|(?:(?:1[6-9]|[2-9]\d)?\d{2})(\/|-|\.)(?:(?:(?:0?[1357
8]|1[02])\2(?:31))|(?:(?:0?[1,3-9]|1[0-2])\2(29|30))|(?:(?:0
?[1-9])|(?:1[0-2]))\2(?:0?[1-9]|1\d|2[0-8]))$

And tell me what it does? Better yet it doesn't work 1% of the time
tell me why. This is an actual RE btw. You have to imagine the person
that made this didn't type it up and have it work in one shot, in a
couple of seconds. Sure you can use a 3rd party tool to decipher this
and costs extra money and that nobody else probably has or is familiar
with but why? So I have something that will search the string 10x
slower?

Next the C# code is incorrect. It should really be this.

// The start Position
int Start = InputName.IndexOf("From:")+"From:".Length;
// The end position
int End = InputName.IndexOf("To:",Start);
// The output
string OutputName = InputName.Substring(Start,End - Start);
// Clear off leading / trailing spaces
OutputName = OutputName.Trim();

What's hard to understand about these 3 lines of code? The methods are
clearly named and better yet if you really don't understand it step
through it and see. Which brings me back to my point at least if that
code wasn't working I could easly find out why.

Next your RE is broken too. Consider the following input.
From:John To:Mike
From: JohnTo:Mike
From:John To:Mike
From: To:Bob
From:To:Bob
From: John To:Mike
From:John To:Mike
From: John To:Mike

Your RE will only match 3 of these strings. Which ones? Sounds like
you need to do some debugging also? I was able to easily debug your C#
code.

for the sake of argument lets correct that too.

The Regex version:
Match m = Regex.Match(inputString, "From:(\w*)(.*)(\w*)To:");
if (m.Success)
OutputName = m.Groups[1];

Even this simple RE gets a little more complex and Cryptic.

Okay but there is another thing over looked here.

For the input string From:John To:Mike
The C# code will return the string

"John"

and the RE will return
"From:John To:"

Looks like I need to run my C# code anyway?! Now I have done 2 string
operations and the 2 lines has turned into about six. This is the
extra processing that I refer to with RE's. RE's usually require 2
steps a match then some extra processing on the match. Sure you can do
this with the whole prefix postfix syntax, but again this "simple" RE
is getting more and more complex.

Also neither of these implementation take into consideration really
poorly formated input. With the C# since you are deep in the proceess
you could throw errors like "These is no 'To' after 'For' on line 3",
instead of just missing Matches.

Let's Chalk this up to a bad example and move on.
Interesting. Could you post such a RE, and the plain code?

Initally I ran into this while building an RE to extract all the links
out of an html page. Because of NDAs and Copyrights I don't feel
comfortable posting the code. But this should be an easy one right? I
know my RE code to do it about a Page, mostly because of having the
post process results. Turns out to be about the same amount of code as
the C# string operations actually, except C# code runs about 10x
faster, because it doesn't have to do any auxilary copying or post
processing. It finds exatacally what it needs at extracts it and never
goes over a single byte in the string twice.

Putting it in a RegEx testing program like Expresso and removing parts of it
usually does a similar job.

Again now I need a 3rd party tool. I have a good debugging tool for C#
why not use that?

On the other hand, you usually can't copy C# code to some other program and
simply run it there, because it is coupled with other parts of the project.

It's true I can't copy in a program PERL or something but who cares.
As far as moving C# code around it's no problem why would a string
operation be coupled to anything. The RE code is C# code in the end
too. This is kind of a Mute point.
That applies to pretty much any tool I can think of...

Yea thats true, like I said I still use them occasionally, as a hack,
when I don't care about performance and it actually saves me some
code, and I am not really scared what happens when my "untested"
silently fails to match.
Depends on what you're doing, and how you're doing it. If you are searching
for a long pattern like "Thomas Jefferson" in a long string (> 100
characters), the RE is about 10 times faster than a culture-invariant
IndexOf.

Wrong, I tried that exact example. Please show me where you get this
data.

string input="ndvskjdfsnkvjfdsdsfjvkdfsnklvjndfsjkvldfsnjvndfsnjvklfdjslvdfsnklvdfskvdfs
Thomas Jefferson mvkdskl;dfskl;vmdfskvlmfdskldfsmkl;vmfdsmkldfskl;vm";

{
DateTime start=DateTime.Now;
Regex rex=new Regex("Thomas Jefferson");
for(int i=0;i<1000000;i++)
{
Match m=rex.Match(input);
}
DateTime end=DateTime.Now;

TimeSpan ts=end-start;
Console.WriteLine(ts.TotalMilliseconds.ToString());
}

{
DateTime start=DateTime.Now;
for(int i=0;i<1000000;i++)
{
int imp=input.IndexOf("Thomas Jefferson");
}
DateTime end=DateTime.Now;

TimeSpan ts=end-start;
Console.WriteLine(ts.TotalMilliseconds.ToString());
}

The RE took over 2,000ms where the indexof branch took only 300 ms.
This is about 5X slower. If you could show me some code where the RE
was actually FASTER I would love to see it. Or better yet how to fix
this example made I am just missing something.

But again the time doesn't ever usualy come from the Match itself but
what you have to do to the Match like in the From:To: example. The
extra time it take to copy an intermediate string and process that
overshadows all of this. A custom taylored algorythim will just never
do this.

Also this is an extremly simple re, no |'s or complex expressions.
Also most of the time in complex examples I don't use indexof but
rather walk the string char by char and mantain a state machine. Yes
sometimes this is more complex, but this complexity usually is
proportional to the complexity of the RE, and again I can debug it.

So again I'm looking at harder to write and debug and about 5x slower.
If it's in the 10% of the code that use 90% of the time, you should of
course choose the fastest code you can get.

Agreed, this is where usings RE's is just not an option.

So what?


A quick hack with full error handling, that's usually much more stable and
robust that those untested "5-10 lines of code"...

It's a Hack because it DOESN'T have full error handleing and is dog
slow. When it fails, sure it doens't crash, (neither does IndexOf)it
just doesn't return a Match. Which leaves the RE "untested" and in
need of debuging also. If I am concerned about my parsing code
crashing I can wrap it in a try catch block, and just let it pass
though, so it works like the RE failing, but it might actully be
helpfull for it to crash in a debugger so I can see what is going
wrong instead of just guessing.

Again maybe this is just Microsoft's implementation of RE's. If the
the RE's were really fast. This would start to make sense to me. I
wonder if anyone has information on this.


Tom
 
Tom said:
...
First of all this is one of the extremely trival examples that I might
actually use a Regular expression for. Consider a RE like this

^(?:(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26
])|(?:(?:16|[2468][048]|[3579][26])00)))(\/|-|\.)(?:0?2\1(?:
29))$)|(?:(?:1[6-9]|[2-9]\d)?\d{2})(\/|-|\.)(?:(?:(?:0?[1357
8]|1[02])\2(?:31))|(?:(?:0?[1,3-9]|1[0-2])\2(29|30))|(?:(?:0
?[1-9])|(?:1[0-2]))\2(?:0?[1-9]|1\d|2[0-8]))$

And tell me what it does? Better yet it doesn't work 1% of the time
tell me why. This is an actual RE btw. ...

I didn't have a closer look at it, probably it does some numeric range
check. (wild guessing). Doesn't look as if using regular expressions for
that problem was a smart decision.

If that's the kind of problem you think of, I'd agree with you. Conventional
string parsing would be the better choice here.
(Books about RE's usually contain samples as these just to show what it
*possible* with them.)
Sure you can use a 3rd party tool to decipher this
and costs extra money and that nobody else probably has or is familiar
with but why?

I use Expresso and Regulator.
Both are free and quite common.
Next the C# code is incorrect. It should really be this.

// The start Position
int Start = InputName.IndexOf("From:")+"From:".Length;
// The end position
int End = InputName.IndexOf("To:",Start);
// The output
string OutputName = InputName.Substring(Start,End - Start);
// Clear off leading / trailing spaces
OutputName = OutputName.Trim();

What's hard to understand about these 3 lines of code?

Counting them, apparently ;-)
The methods are
clearly named and better yet if you really don't understand it step
through it and see. Which brings me back to my point at least if that
code wasn't working I could easly find out why.

I can only guess you never spent hours debugging for an error that's not
reproducible? The code above will simply crash if the input is malformed.
Suppose the input was coming in on some kind of server application; It'll
run fine in all your test environments, while your customer will complain
that it constantly crashes twice a day...
Next your RE is broken too. Consider the following input.
From:John To:Mike
From: JohnTo:Mike
From:John To:Mike
From: To:Bob
From:To:Bob
From: John To:Mike
From:John To:Mike
From: John To:Mike

Your RE will only match 3 of these strings. Which ones? Sounds like
you need to do some debugging also? I was able to easily debug your C#
code.

Now *you* are guessing: You don't know what the strings are supposed to look
like. I wouldn't be surprised if the spaces were actually required.
...
Okay but there is another thing over looked here.

For the input string From:John To:Mike
The C# code will return the string

"John"

and the RE will return
"From:John To:"

Use grouping parantheses.
That's not an argument; It's ignorance.
...
Also neither of these implementation take into consideration really
poorly formated input. With the C# since you are deep in the proceess
you could throw errors like "These is no 'To' after 'For' on line 3",
instead of just missing Matches.

You could, which would make those 4 lines to a few hundred, if you really
considered every possible error.
However, if we're comparing apples with apples, the C# version simply
crashes if the input is malformed.
Let's Chalk this up to a bad example and move on.

It's a common everyday example...

Just to extend it a little further: suppose that string matching code would
have to be localized... Say, to a left-to-right reading language...
...[no counterexample]...
Putting it in a RegEx testing program like Expresso and removing parts of it
usually does a similar job.

Again now I need a 3rd party tool. I have a good debugging tool for C#
why not use that?

Do you use a bitmap editor?
A dialog editor?
A zip or other compression tool?
A tooth-brush???
Or do you do all these tasks with your debugger?
I don't see your point...

Different tools for different purposes.
...
It's true I can't copy in a program PERL or something but who cares.
As far as moving C# code around it's no problem why would a string
operation be coupled to anything. The RE code is C# code in the end
too. This is kind of a Mute point.

"Reusability"; ever heared the word before?

To give you a clearer example:

private With ParseWithStatement(){
Context withCtx = this.currentToken.Clone();
AST obj = null;
AST block = null;
this.blockType.Add(BlockType.Block);
try{
GetNextToken();
if (JSToken.LeftParen != this.currentToken.token)
ReportError(JSError.NoLeftParen);
GetNextToken();

this.noSkipTokenSet.Add(NoSkipTokenSet.s_BlockConditionNoSkipTokenSet);
try{
obj = ParseExpression();
....

Do you think you could simply copy that kind of code somewhere and debug it,
or reuse it? (BTW: This is real string parsing code)
Wrong, I tried that exact example. Please show me where you get this
data.

public static void Main()
{
string searchString = "Thomas Jefferson";
string bigString;
StringBuilder builder = new StringBuilder();

const int stringLength = 100;
const int repeatCount = 100000;

Random rnd = new Random(1234);

for (int i=0; i<stringLength; i++)
builder.Append((char)(rnd.Next()%10+'0'));

bigString = builder.ToString();

CompareInfo info = CultureInfo.InvariantCulture.CompareInfo;

{
Console.WriteLine("Testing String.IndexOf:");
long start_time = DateTime.Now.Ticks;
for (int i=0; i<repeatCount; i++)
{
int index = info.IndexOf(bigString, searchString);
}
long end_time = DateTime.Now.Ticks;
Console.WriteLine(new TimeSpan(end_time - start_time));
}

{
Console.WriteLine("Testing Regex.Match:");
Regex regex = new Regex(searchString);
long start_time = DateTime.Now.Ticks;
for (int i=0; i<repeatCount; i++)
{
Match m = regex.Match(bigString);
}
long end_time = DateTime.Now.Ticks;
Console.WriteLine(new TimeSpan(end_time - start_time));
}

Console.ReadLine();
}

You may play with the constants: RE's use the Boyer-Moore algorithm which is
about O(N/M), while ordinary a String.IndexOf is about O(N). The longer the
strings in question are, the faster the regex gets compared to IndexOf.
But again the time doesn't ever usualy come from the Match itself but
what you have to do to the Match like in the From:To: example. The
extra time it take to copy an intermediate string and process that
overshadows all of this. A custom taylored algorythim will just never
do this.

We've had that.
Look up "grouping paranthesis".
Also this is an extremly simple re, no |'s or complex expressions.

Right. Simple ones are fast ones.
What did you expect?
Also most of the time in complex examples I don't use indexof but
rather walk the string char by char and mantain a state machine. Yes
sometimes this is more complex, but this complexity usually is
proportional to the complexity of the RE, and again I can debug it.

You're comparing apples with oranges here.

There are many cases where using a state-machine or a recursive parser is
far superior to a RE, noone doubts that. But sometimes RE's are simply the
better solution.

Niki
 
Back
Top