Memory leak in RegEx?

  • Thread starter Thread starter Anthony Shorrock
  • Start date Start date
A

Anthony Shorrock

I am looping over records in a table using a datareader. I am parsing the
organisation name in each record using RegEx. What I am finding is that the
memory reported by task manager goes up continually when I call my
ParseOrganisation method. Without the call, the memory remains at a
constant. My problem is when I run this across millions of records the
memory consumed will go up substancially.

Does anyone know of a memory leak in dotnet's RegEx class?

I would be grateful for any assistance.

Cheers



This is my parse organisation method...

public void ParseOrganisation(string organisation)

{


Match m = null;

string expression =
System.Configuration.ConfigurationSettings.AppSettings["orgRegEx"];

// First set the Regex options based on the check boxes

RegexOptions TheOptions=RegexOptions.None;

TheOptions|=RegexOptions.IgnoreCase;

TheOptions|=RegexOptions.IgnorePatternWhitespace;

TheOptions|=RegexOptions.Multiline;


OrgMatcher = new Regex(expression, TheOptions);

//Org Processing

m = OrgMatcher.Match(Scrub(organisation));



if (m.Success)

{

this.org = m.Result("${org}");

}


}
 
Regex's do use memory but it's nothing too dramatic most of the time. The
Regex may be caught trying to find something and looking over and over. Can
you post the regex and an example of one of the finds?

Does it do it on every record or just some?
 
Anthony said:
I am looping over records in a table using a datareader. I am parsing the
organisation name in each record using RegEx. What I am finding is that the
memory reported by task manager goes up continually when I call my
ParseOrganisation method. Without the call, the memory remains at a
constant. My problem is when I run this across millions of records the
memory consumed will go up substancially.

Does anyone know of a memory leak in dotnet's RegEx class?

If a Regex is created with the RegexOptions.Compiled option set, then
the documented behavior is that the generated IL for the Regex will
never be collected. However, your example doesn't set that bit.

Even so, I would not be shocked to find that Regex's which are not
compiled to IL (they are compiled to some undocumented interpreted form)
might have similar behavior under certain conditions.

I would try a small change of setting up your Regex once in a static
member, and use that Regex repeatedly, instead of creating and disposing
what is essentially the same Regex on each call to the
ParseOrganisation() method.

Something like (warning - untested):

public class YourClass // I don't know what your class name is...
{
static private Regex OrgMatcher = null;

static YourClass() {
string expression =
System.Configuration.ConfigurationSettings.AppSettings["orgRegEx"];

OrgMatcher = new Regex( expression,
RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace |
RegexOptions.Multiline);
}

}


Now you can drop all the stuff in ParseOrganization that initializes
OrgMatch, rerun the test and see if the memory problem is fixed.

As a bonus, you could compile the Regex to IL, and gain the benefit of a
regex operation that gets JIT'ed to native code.

I would be grateful for any assistance.

Cheers



This is my parse organisation method...

public void ParseOrganisation(string organisation)

{


Match m = null;

string expression =
System.Configuration.ConfigurationSettings.AppSettings["orgRegEx"];

// First set the Regex options based on the check boxes

RegexOptions TheOptions=RegexOptions.None;

TheOptions|=RegexOptions.IgnoreCase;

TheOptions|=RegexOptions.IgnorePatternWhitespace;

TheOptions|=RegexOptions.Multiline;


OrgMatcher = new Regex(expression, TheOptions);

//Org Processing

m = OrgMatcher.Match(Scrub(organisation));



if (m.Success)

{

this.org = m.Result("${org}");

}


}
 
Thanks a lot for your help.

I also noticed that adhoc calls of Regex.IsMatch was eating at my memory and
never being recovered.

After converting every regex I did into static members and incurring that
one time only memory hit this has solved my memory leak.

Cheers again.
 
Welcome to the wonderful world of garbage collectors. What happens is that
every time you call your function new objects are created (using some
memory). And until there's a garbage collection that memory will not be
freed. If you call this few million times you're going to have a problem.
You can either keep those objects in the higher scope (in the calling
function) so you don't recreate them for every record or you'd have to
switch to a language where you have control over memory allocation, such as
[unmanaged] C++.

Jerry
 
pardon me, but I recall C# being a garbage-collected language, so what
do you mean by leak?
 
Back
Top