Regular Expression limits?

  • Thread starter Thread starter gilad
  • Start date Start date
G

gilad

I'm working on an application in C# that will perform regular expression
matching against a small string. Usually regular expressions are used
such that the text being searched is large while the regular expression
itself is relatively small. What I'm doing is exactly the opposite. The
searched string will generally be small, while the Regex is going to be
large (in terms of a regular expression pattern anyway).

The reason the Regex will be large is because it is taking a large
number of disparate patterns and stringing them together using OR, e.g.

"pattern1|pattern2|pattern3..."

The string being searched could potentially match any of these patterns,
so it must be checked against them. What happens if I have a string of
1000 patterns OR'd together as a single Regex? Is performance hindered,
does Regex have limits that won't allow this?

Thanks, gilad
 
gilad said:
I'm working on an application in C# that will perform regular expression
matching against a small string. Usually regular expressions are used
such that the text being searched is large while the regular expression
itself is relatively small. What I'm doing is exactly the opposite. The
searched string will generally be small, while the Regex is going to be
large (in terms of a regular expression pattern anyway).

The reason the Regex will be large is because it is taking a large
number of disparate patterns and stringing them together using OR, e.g.

"pattern1|pattern2|pattern3..."

The string being searched could potentially match any of these patterns,
so it must be checked against them. What happens if I have a string of
1000 patterns OR'd together as a single Regex? Is performance hindered,
does Regex have limits that won't allow this?

Well, the easiest way to find out would be to just test it.

However, you should probably also test what happens if you just have
1000 separate patterns, and just test each of them in turn. That may be
faster or it may be slower, but in my view it would almost certainly be
easier to understand...
 
The reason the Regex will be large is because it is taking a large
number of disparate patterns and stringing them together using OR, e.g.

"pattern1|pattern2|pattern3..."

The string being searched could potentially match any of these patterns,
so it must be checked against them. What happens if I have a string of
1000 patterns OR'd together as a single Regex? Is performance hindered,
does Regex have limits that won't allow this?

I've not seen anything about limits (unlike the Linux POSIX regex API,
which has a 64K regex table limit) - try it and see.

As for performance, the regex compiler will generally do a great job
implementing the pattern you give it, but I don't believe it will do
much optimization. For example,

"^(pattern1|pattern2|pattern3)$"

may be essentially equivalent to

"^pattern1$|^pattern2$|^pattern3$"

(with RegexOptions.ExplicitCapture turned on) but will, so far as I
know, result in tow distinct regexes. I mention this because if you
have to match, say, "abe", "abet", or "abraham", you may get better
performance by sorting and building a tree

"^ab((e($|t$))|raham$)"

than by matching for

"^(abe|abet|abraham)$"

Again, try it and see.
 
Thanks for the feedback. I was hoping there was something like

Jon said:
I've not seen anything about limits (unlike the Linux POSIX regex API,
which has a 64K regex table limit) - try it and see.

that would simply tell me. I guess I'll just have to test.
However, you should probably also test what happens if you just have
1000 separate patterns, and just test each of them in turn. That may be
faster or it may be slower, but in my view it would almost certainly be
easier to understand...

I agree that it might be more clear to represent the patterns
individually. However, it also seems (from a logical point of view) that
instead of having to compute each of the 1000 pattern strings as a regex
individually, it would be faster to compute them all at once. Ultimately
I don't think the regex will ever be very complex--it will mostly
consist of simple patterns strung together with the OR operator. It
might just be large. Anyway, I'm off to test my assumptions...

Thanks again for the comments. gilad
 
Back
Top