Regex search: advanced search range settings?

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I'm using .NET 2.0.

I need (for performance reasons) to restrict Regex searches to a certain
portion of a large string. The Regex.Match functions allow me to input the
beginning and ending position of the search. However, what I need is to find
whether there is a Regex match that begins no later than a certain character
position.

For a trivial example, consider the string:
abcde

My regular expression is "cd", and my search range is characters 0-2. The
Regex.Match functions will fail on this search ("cd" is not in "abc"), but I
need it to find any matches that *begin* within the range, and "cd" does
begin on or before character 2.

I can't simply lengthen the allowed range (in this case searching 0-3
instead of 0-2), since my actual regular expressions match strings of
arbitrary length.

Any suggestions?
 
[...]
I can't simply lengthen the allowed range (in this case searching 0-3
instead of 0-2), since my actual regular expressions match strings of
arbitrary length.

I don't understand this statement. Not only can you lengthen the allowed
range, you must. I don't see any way for Regex to find characters that
you hide from it.

Would this work?

string strExpression;
string strSearch;
int ichStart, cchLength;

// initialize above variables

Regex regex = new Regex(strExpression);

return Regex.Match(strSearch, ichStart,
Math.Min(strSearch.Length - ichStart,
cchLength + strExpression.Length - 1));

Essentially, extend the search length by the number of characters in your
expression, but then constrain it to ensure that the actual length passed
to Match() doesn't exceed the length of the string to be searched.

In your example, the variables are:

strExpression: "cd"
strSearch: "abcde"
ichStart: 0
cchLength: 3

This results in a call to Regex.Match("abcde", 0, 4), which will find the
string you're looking for.

Pete
 
OK, my example was too trivial to illustrate my point. Consider the
following string:

<html><head></head><script type="text/javascript"></script>(2MB of HTML text
here)</html>

My regular expression might be something like this:
<\s*script\s*(type=['"]text/javascript['"])?\s*>

For performance reasons (this and hundreds of similar regexes need to be run
in a few milliseconds), I can't search all 2MB of text for this regular
expression. Based on other information available to me from my algorithm, I
am completely uninterested in script tags that *begin* after character 20.
But I can't just restrict my search to characters 0-20, since the Regex class
only matches strings that lie completely within the given range.

However, because of my strict performance requirements, I can't lengthen the
Regex's search domain to the entire 2MB string. Since my regular expression
could match a string that is 10 characters long or 1000 characters long or
100,000 characters long, it's impossible for me to determine the amount to
lengthen the Regex's search range.

This was not an issue when I was using boost::regex, as that library allows
you to search for matches that extend past the end of a given range. I've
ported most of my code to C#, and I had to remove this very important
optimization due to what I see as a limitation in .NET's Regex class.

So my question is, how can I instruct the Regex class to search within a
given range, but allow the match to extend beyond the end of the given range
if necessary?

Or do I get to write .NET bindings for boost::regex? :-p

Peter Duniho said:
[...]
I can't simply lengthen the allowed range (in this case searching 0-3
instead of 0-2), since my actual regular expressions match strings of
arbitrary length.

I don't understand this statement. Not only can you lengthen the allowed
range, you must. I don't see any way for Regex to find characters that
you hide from it.

Would this work?

string strExpression;
string strSearch;
int ichStart, cchLength;

// initialize above variables

Regex regex = new Regex(strExpression);

return Regex.Match(strSearch, ichStart,
Math.Min(strSearch.Length - ichStart,
cchLength + strExpression.Length - 1));

Essentially, extend the search length by the number of characters in your
expression, but then constrain it to ensure that the actual length passed
to Match() doesn't exceed the length of the string to be searched.

In your example, the variables are:

strExpression: "cd"
strSearch: "abcde"
ichStart: 0
cchLength: 3

This results in a call to Regex.Match("abcde", 0, 4), which will find the
string you're looking for.

Pete
 
What you need to do is identify the characters that will form a sequence
that uniquely identifies the beginning and end of the pattern you're looking
for. The you can use String.IndexOf to find whether the "script" begins
before character 20. If it does, use String.IndexOf to find the point where
the sequence ends, and use your Regular Expression on the substring
identified.

Another option, depending upon your actual requirements, would be to use
unsafe C code to create a pointer to the beginning of the string and rather
than using a managed Regular Expression, simply iterate the characters in
the string. Unsafe pointers in managed code are still a lot faster than
anything you can do with managed code alone.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

BeanDog said:
OK, my example was too trivial to illustrate my point. Consider the
following string:

<html><head></head><script type="text/javascript"></script>(2MB of HTML
text
here)</html>

My regular expression might be something like this:
<\s*script\s*(type=['"]text/javascript['"])?\s*>

For performance reasons (this and hundreds of similar regexes need to be
run
in a few milliseconds), I can't search all 2MB of text for this regular
expression. Based on other information available to me from my algorithm,
I
am completely uninterested in script tags that *begin* after character 20.
But I can't just restrict my search to characters 0-20, since the Regex
class
only matches strings that lie completely within the given range.

However, because of my strict performance requirements, I can't lengthen
the
Regex's search domain to the entire 2MB string. Since my regular
expression
could match a string that is 10 characters long or 1000 characters long or
100,000 characters long, it's impossible for me to determine the amount to
lengthen the Regex's search range.

This was not an issue when I was using boost::regex, as that library
allows
you to search for matches that extend past the end of a given range. I've
ported most of my code to C#, and I had to remove this very important
optimization due to what I see as a limitation in .NET's Regex class.

So my question is, how can I instruct the Regex class to search within a
given range, but allow the match to extend beyond the end of the given
range
if necessary?

Or do I get to write .NET bindings for boost::regex? :-p

Peter Duniho said:
[...]
I can't simply lengthen the allowed range (in this case searching 0-3
instead of 0-2), since my actual regular expressions match strings of
arbitrary length.

I don't understand this statement. Not only can you lengthen the allowed
range, you must. I don't see any way for Regex to find characters that
you hide from it.

Would this work?

string strExpression;
string strSearch;
int ichStart, cchLength;

// initialize above variables

Regex regex = new Regex(strExpression);

return Regex.Match(strSearch, ichStart,
Math.Min(strSearch.Length - ichStart,
cchLength + strExpression.Length - 1));

Essentially, extend the search length by the number of characters in your
expression, but then constrain it to ensure that the actual length passed
to Match() doesn't exceed the length of the string to be searched.

In your example, the variables are:

strExpression: "cd"
strSearch: "abcde"
ichStart: 0
cchLength: 3

This results in a call to Regex.Match("abcde", 0, 4), which will find the
string you're looking for.

Pete
 
Back
Top