limits on regex?

  • Thread starter Thread starter George Neuner
  • Start date Start date
G

George Neuner

Hi all,

Does anyone have any hard data on the size / complexity limits of the
..NET regex implementation - either as interpreted or as compiled to an
assembly?

I have an application that scans structured documents for user
specified words and phrases (including looking for common misspellings
and/or mispunctuations of words in the user's pattern) which I am
planning to port to .NET. The current app generates a separate
pattern to be applied to each document section and then searches over
a database of documents picking out those that match.

Currently pattern complexity is limited only by available memory as
the regex implementation uses interpreted DFAs. I have seen some
users construct tremendously complicated patterns which have many
hundreds of states in the resulting DFA. Although this is far from
typical usage, the current implementation allows it and I would like
to preserve as much of the current functionality as possible in any
new version.

Preferably I would like to compile each generated regex to an assembly
and dynamically load it for use ( I know this requires auxiliary
appdomains to unload ). Basically I'd like to get a feel for how
complex a single regex may be before I go doing a lot of coding using
them and then trying to figure out why it doesn't work.

Thanks,
George
 
Hi, George

Due to the nature of regular expressions language you won't be able to get
definitive information in this respect. My experience has demonstrated that
resulting behavior of regexes is sometimes very dependent on input data.
Especially in cases where backtracking and recursion are used in underlying
implementation. Same regex could work and fail on 2 different strings. By
fail I mean not exactly exception or invalid result - more like 100% CPU for
minutes if not hours. Also, depending on complexity of expression
performance might degrade significantly for longer input strings. Hard data
in such cases are absolutely inconclusive because they change depending on
input, not on the complexity of regex expression.
I found that best of all regexes work when expressions are not requiring
extensive backtracking, are not very ambiguous (not many alternative paths
in same expression - best of all, none) and are targeting deterministic
input. The more complex regex is the more unpredictable is resulting
behavior.

I would suggest also to reconsider the design. It doesn't make much sense
from my point of view to generate separate assembly for regexes. It is
sufficient to load regex expression strings as the need arises and recreate
regexes as required. However in your cases I would do pattern analysis when
storing / changing documents in DB and keep resulting pattern description
together with document data. If volume of documents to search will grow
significantly regexing over them will become problematic.

HTH
Alex
 
Hi Alex,

Due to the nature of regular expressions language you won't be able to get
definitive information in this respect.

That's unfortunate ... I was hoping someone could offer some insight
based on actual experience of using complicated patterns.
My experience has demonstrated that resulting behavior of regexes
is sometimes very dependent on input data. Especially in cases
where backtracking and recursion are used in underlying implementation.

Agreed. The current lib uses DFAs with no backtracking. I don't know
about recursion, but it's a graph implementation so I guess it could
reuse subexpressions. It is slow to construct the FAs and can be
memory intensive, but the FAs are then reused over many documents so
the construction time is rarely an issue.
Same regex could work and fail on 2 different strings. By fail I mean
not exactly exception or invalid result - more like 100% CPU for
minutes if not hours. Also, depending on complexity of expression
performance might degrade significantly for longer input strings. Hard data
in such cases are absolutely inconclusive because they change depending on
input, not on the complexity of regex expression. I found that best of all
regexes work when expressions are not requiring extensive backtracking,
are not very ambiguous (not many alternative paths in same expression -
best of all, none) and are targeting deterministic input. The more complex
regex is the more unpredictable is resulting behavior.

I don't like the sound of that. FA behavior should be predictable -
one state transition per input character for DFA ... bit more
complicated for NFA due to cloning, but still should be predictable
from the pattern (assuming one knows the implementation).

I have read that .NET's implementation is based on Perl's - which I
haven't used but also haven't heard many complaints about. If .NETs
implementation is as unpredictable as you indicate, I would say it is
unusable for serious work.
I would suggest also to reconsider the design. It doesn't make much sense
from my point of view to generate separate assembly for regexes. It is
sufficient to load regex expression strings as the need arises and recreate
regexes as required. However in your cases I would do pattern analysis when
storing / changing documents in DB and keep resulting pattern description
together with document data. If volume of documents to search will grow
significantly regexing over them will become problematic.

That's a good idea, but unfortunately I don't have any control over
the document database. They are provided "as is" and need to be
searched / filtered in complicated ways.

Thanks for the input.
George
 
Back
Top