Memory Management extremely poor in C# when manipulating string..

G

Guest

I have a fairly simple C# program that just needs to open up a fixed width
file, convert each record to tab delimited and append a field to the end of
it.

The input files are between 300M and 600M. I've tried every memory
conservation trick I know in my conversion program, and a bunch I picked up
from reading some of the MSDN C# blogs, but still my program ends up using
hundreds and hundreds of megs of ram. It is also taking excessively long to
process the files. (between 10 and 25 minutes). Also, with each successive
file I process in the same program, performance goes way down, so that by the
3rd file, the program comes to a complete halt and never completes.

I ended up rewriting the process in perl which takes only a couple minutes
and never really gets above a 40 M footprint.

What gives?

I'm noticing this very poor memory handling in all my programs that need to
do any kind of intensive string processing.

I have a 2nd program that just implements the LZW decompression
algorithm(pretty much copied straight out of the manuals.) It works great on
files less than 100K, but if I try to run it on a file that's just 4.5M
compressed, it runs up to 200+ Megs footprint and then starts throwing Out of
Memory exceptions.

I was wondering if somebody could look at what I've got down and see if I'm
missing something important? I'm an old school C programmer, so I may be
doing something that is bad.

Would appreciate any help anybody can give.

Regards,

Seg
 
J

Jon Skeet [C# MVP]

Segfahlt said:
I have a fairly simple C# program that just needs to open up a fixed width
file, convert each record to tab delimited and append a field to the end of
it.

The input files are between 300M and 600M. I've tried every memory
conservation trick I know in my conversion program, and a bunch I picked up
from reading some of the MSDN C# blogs, but still my program ends up using
hundreds and hundreds of megs of ram. It is also taking excessively long to
process the files. (between 10 and 25 minutes). Also, with each successive
file I process in the same program, performance goes way down, so that by the
3rd file, the program comes to a complete halt and never completes.

I ended up rewriting the process in perl which takes only a couple minutes
and never really gets above a 40 M footprint.

What gives?

It's very hard to say without seeing any of your code. It sounds like
you don't actually need to load the whole file into memory at any time,
so the memory usage should be relatively small (aside from the overhead
for the framework itself).
I'm noticing this very poor memory handling in all my programs that need to
do any kind of intensive string processing.

I have a 2nd program that just implements the LZW decompression
algorithm(pretty much copied straight out of the manuals.) It works great on
files less than 100K, but if I try to run it on a file that's just 4.5M
compressed, it runs up to 200+ Megs footprint and then starts throwing Out of
Memory exceptions.

I was wondering if somebody could look at what I've got down and see if I'm
missing something important? I'm an old school C programmer, so I may be
doing something that is bad.

Would appreciate any help anybody can give.

Could you post a short but complete program which demonstrates the
problem?

See http://www.pobox.com/~skeet/csharp/complete.html for details of
what I mean by that.
 
W

Willy Denoyette [MVP]

Segfahlt said:
I have a fairly simple C# program that just needs to open up a fixed width
file, convert each record to tab delimited and append a field to the end
of
it.

The input files are between 300M and 600M. I've tried every memory
conservation trick I know in my conversion program, and a bunch I picked
up
from reading some of the MSDN C# blogs, but still my program ends up using
hundreds and hundreds of megs of ram. It is also taking excessively long
to
process the files. (between 10 and 25 minutes). Also, with each
successive
file I process in the same program, performance goes way down, so that by
the
3rd file, the program comes to a complete halt and never completes.

I ended up rewriting the process in perl which takes only a couple minutes
and never really gets above a 40 M footprint.

What gives?

I'm noticing this very poor memory handling in all my programs that need
to
do any kind of intensive string processing.

I have a 2nd program that just implements the LZW decompression
algorithm(pretty much copied straight out of the manuals.) It works great
on
files less than 100K, but if I try to run it on a file that's just 4.5M
compressed, it runs up to 200+ Megs footprint and then starts throwing Out
of
Memory exceptions.

I was wondering if somebody could look at what I've got down and see if
I'm
missing something important? I'm an old school C programmer, so I may be
doing something that is bad.

Would appreciate any help anybody can give.

Regards,

Seg

That's really hard to answer such broad question without a clear description
of the algorithm used or by seeing any code, so I'll have to guess:
1. You read the whole input file into memory.
2. You store each modified record into an array of strings or into a
StringArray, and write it to the file when done with the input file.
3. 1 + 2
.....
Anyway you seem to hold too much strings in memory before writing to the
output file.

Willy.
 
R

rossum

I have a fairly simple C# program that just needs to open up a fixed width
file, convert each record to tab delimited and append a field to the end of
it.

The input files are between 300M and 600M. I've tried every memory
conservation trick I know in my conversion program, and a bunch I picked up
from reading some of the MSDN C# blogs, but still my program ends up using
hundreds and hundreds of megs of ram. It is also taking excessively long to
process the files. (between 10 and 25 minutes). Also, with each successive
file I process in the same program, performance goes way down, so that by the
3rd file, the program comes to a complete halt and never completes.

I ended up rewriting the process in perl which takes only a couple minutes
and never really gets above a 40 M footprint.

What gives?

I'm noticing this very poor memory handling in all my programs that need to
do any kind of intensive string processing.

I have a 2nd program that just implements the LZW decompression
algorithm(pretty much copied straight out of the manuals.) It works great on
files less than 100K, but if I try to run it on a file that's just 4.5M
compressed, it runs up to 200+ Megs footprint and then starts throwing Out of
Memory exceptions.

I was wondering if somebody could look at what I've got down and see if I'm
missing something important? I'm an old school C programmer, so I may be
doing something that is bad.

Would appreciate any help anybody can give.

Regards,

Seg

A thought: every time you change a "String" a whole new copy is made;
all C# Strings are immutable. If you are going to make a lot of
changes to chunks of text then use a "StringBuilder" instead. If
needed then just convert the final StringBuilder to a String at the
end of the manipulation.

rossum


The ultimate truth is that there is no ultimate truth
 
G

Guest

This is for Jon's & Willy's replies.

1.) No. I'm not loading the whole file into memory. I'm reading it line by
line. Originally I tried reading it in in 50M chunks. I slowly whittled it
down to 1M chunks, and still found no relief. Now, I'm reading it in line by
line.

2.) I'm not sure if I can post a "short but complete" version that
demonstrates what I'm seeing. I can post a trimmed down version, but it
encompasses some other data structures that I think are affecting the memory
management(Hashtables). I guess I could post the basic code with some stats
on the hash tables.

The entire algorith is only about 300 lines(with comments) so I'm going to
go ahead and post it here(minus some of the superflous stuff).

Please let me know if it would be more beneficial to post an actual working
program. I would think the idea here is not to identify bugs, but identify
where memory is not getting released.

-----------BEGIN CODE SNIPPET------------------
private void ProcessInputFile(FileInfo f) {
/* NOTE: the UPC object has been previously instantiated & populated
prior to this call. */
string rejectFilePath = null;
string outputFilePath = null;

int RecordsConverted = 0;
int RejectedRecords = 0;
int foundBySKU = 0;
int foundByToy = 0;
int foundByLocalSKU = 0;
int foundByLocalToy = 0;
string inLine = null;
Hashtable lSKU_UPC_HASH = new Hashtable();
Hashtable lTOY_UPC_HASH = new Hashtable();



StreamReader IF = new StreamReader(f.FullName);

//this is how we get the DATE_END value so we can name our output file
//accordingly. First line is junk. 2nd line is what we want.
IF.ReadLine();
inLine = IF.ReadLine();
inLine = Regex.Replace(inLine, @"\s+","");
string[] fieldvalues = Regex.Split(inLine, ",");
string date = fieldvalues[4];
outputFilePath = this.process_path + "\\" + date + "_output_" + f.Name;
IF.Close();


//now start in on the actual processing.
rejectFilePath = this.reject_file_path + "\\" + date + "_reject_" + f.Name;
StreamWriter OFR = new StreamWriter(rejectFilePath ,false);
StreamWriter OF = new StreamWriter(outputFilePath,false);

IF = new StreamReader(f.FullName);
//header information. Need to append "UPC".
inLine = IF.ReadLine();

OFR.WriteLine(inLine);
OFR.Flush();

inLine = Regex.Replace(inLine, @" +, +","\t");
inLine += "\tUPC";

OF.WriteLine(inLine);
OF.Flush();

string prevSKU = null;
string prevToy = null;
string curSKU = null;
string curToy = null;
string sUPC = null;
while((inLine = IF.ReadLine()) != null) {
RecordsConverted++;
StringBuilder buf = new StringBuilder(1024);

string[] fields = Regex.Split(Regex.Replace(inLine,@" +",""), @",");
//split
curSKU = fields[2];
curToy = fields[6];
/*
* The following bit of code is a somewhat vain attempt at some
performance
* improvements for speed. What we have here is a lookup against two
hashes.
* The first hash, the SKU hash, is only about
* 2200 items long. The Toy hash, though, is over 100,000. The files are
organized
* mostly sorted by SKU. So first, I'll store our SKU and TOY # value in
temp variables.
* Look up the values for them, then continue on to the next loop. On
the next loop, if my
* SKU or TOY are the same, we know we'll get the same UPC from it, so
we'll just use the
* stored UPC from the last loop. If we find a new Toy or SKU, then
we'll look up that new value
* and store it the two dynamically built hashes lSKU_UPC_HASH or
lTOY_UPC_HASH. These guys will be
* much smaller than the full hashes for the same type. Since they are
smaller they'll be faster.
* Finally, if we can't find our UPC based on previous value or values
which we've looked up in
* our smaller local hashes, we'll go to the global hashes to find our
UPC. Once we get it, we'll
* store the TOY and SKU in our local Hashes for use next time.
*/
if(! (curSKU == prevSKU || curToy == prevToy)) {
if(lSKU_UPC_HASH.ContainsKey(curSKU)) {
foundByLocalSKU++;
sUPC = lSKU_UPC_HASH[curSKU].ToString();
} else if(lTOY_UPC_HASH.ContainsKey(curToy)) {
sUPC = lTOY_UPC_HASH[curToy].ToString();
foundByLocalToy++;
} else {
//the SKU's have a . behind them in the text file, so we need
//to strip it out
string sSKU = Regex.Replace(curSKU,@"\.","");
sUPC = UPC.GetUPCBySKUShared(sSKU); //UPC.GetUPCBySKUShared(sSKU) is
just a Hash Lookup by sSKU
if(sUPC != null) {
lSKU_UPC_HASH.Add(curSKU, sUPC);
foundBySKU++;
} else {
string sToy = curToy.Length == 4 ? " " + curToy : curToy;
sUPC = UPC.GetUPCShared(sToy); //UPC.GetUPCShared(sToy) is just a
Hash Lookup by sToy
if(sUPC != null) {
lTOY_UPC_HASH.Add(curToy,sUPC);
foundByToy++;
}
}
}
prevSKU = curSKU;
prevToy = curToy;

//if we can't find a UPC, we need to reject the record. Do this by
writing the record
//to the reject file, bump up our reject record counter and continue.
if(sUPC == null ||sUPC.Length < 1) {
RejectedRecords++;
OFR.WriteLine(inLine);
continue;
}
} // if(! (curSKU == prevSKU || curToy == prevToy)) {
OF.WriteLine(string.Join("\t", fields) + "\t" + sUPC);
} //While IF.ReadLine
OF.Close();
OFR.Close();
IF.Close();

}
-------------END CODE SNIPPET---------------
 
H

Helge Jensen

1. Try running a memory-profiler, for example CLRprofiler - available
for download at microsofts homepage somewhere. That will tell you what's
keeping the objects alive.

2. Try running a performance profiler, for example nprof - available on
the net. That will probably show you that your optimizations don't
really help :)

One thing you will probably observe (using the profiler -- of course :)
is that Regex.* are expensive (atleast in MS-.net). You should probably
compile the inner calls, like

Regex.Split(Regex.Replace(inLine,@" +",""), @",");

using Regex.Compile, and reuse the expression. you can use string.Split
instead of Regex.Split.
2.) I'm not sure if I can post a "short but complete" version that
demonstrates what I'm seeing. I can post a trimmed down version, but it

The program you posted can be "shaved" a lot more and probably still
exhibit the same behaviour. The more you shave it, the more likely it is
that someone will spend time helping you -- or you might find the
problem yourself ;)
Please let me know if it would be more beneficial to post an actual working
program. I would think the idea here is not to identify bugs, but identify
where memory is not getting released.

Start by removing all your optimizations. Then remove lines that doesn't
affect the actual behaviour, for example the counters, the writing to
files, ...
-----------BEGIN CODE SNIPPET------------------
private void ProcessInputFile(FileInfo f) { ....
while((inLine = IF.ReadLine()) != null) {
....
StringBuilder buf = new StringBuilder(1024);

Not used, doesn't the compiler warn you about that?
string[] fields = Regex.Split(Regex.Replace(inLine,@" +",""), @",");

Use string.*
/*
* The following bit of code is a somewhat vain attempt at some
performance
* improvements for speed. What we have here is a lookup against two
hashes.
* The first hash, the SKU hash, is only about

Hash sizes should not influence perfomance of hash-table lookup.
Hashtables lookup (theoretically, and *very* close to that in reality)
is O(1).
* mostly sorted by SKU. So first, I'll store our SKU and TOY # value in
temp variables.
* Look up the values for them, then continue on to the next loop. On
the next loop, if my
* SKU or TOY are the same, we know we'll get the same UPC from it, so
we'll just use the

A profiler will show you how much speed you gain by this, if it is
anything at all it should be *very* little, hash-lookup are screamingly
fast.
* stored UPC from the last loop. If we find a new Toy or SKU, then
we'll look up that new value
* and store it the two dynamically built hashes lSKU_UPC_HASH or
lTOY_UPC_HASH. These guys will be
* much smaller than the full hashes for the same type. Since they are
smaller they'll be faster.

No. Are you *copying* the locally relevant parts of other hashes into
local hashes? don't do that it will definatly not make your program run
faster.

....
sUPC = lSKU_UPC_HASH[curSKU].ToString();

....
Isnt sUPC supposed to be a string? then use a cast.
....
string sSKU = Regex.Replace(curSKU,@"\.","");

string.Replace, or string.Substring
OF.Close();
OFR.Close();
IF.Close();

use the "using" idiom instead.
 
J

Jon Skeet [C# MVP]

Segfahlt said:
This is for Jon's & Willy's replies.

<snip>

Just a reply to say that Helge's reply is spot on - try to remove those
Regexes where you can (you may well not need to be doing everything
that the regex does, and so String.Replace might be fine for the
replacement, and String.Split will almost certainly be fine for the
splitting).

If that doesn't help a lot, posting an actual working program (along
with some way of generating a sample file) would be very useful indeed.
The way I see it, there's a third option as well as identifying bugs
and identifying where memory is not getting released - identifying
places where performance could be improved without it necessarily being
a bug to start with.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top