"Parallel.For GC problems" and a solution.

  • Thread starter Thread starter Robert
  • Start date Start date
R

Robert

Quick summary:
1) running one class serially, all is well
2) running same thing in parallel, Gen 2 bytes go way up, and LOH usage goes way up.
3) the classes share ZERO state. The only thing they share is a callback to the GUI
reporting the number of records processed, and records fixed. The classes do a bunch
of tallying, verifying no dupes, etc. Should parallelize very well..
4) I am using the TPL library, and was using Parallel.For, to spawn of instances for each
file.
5) This causes the GC to get very confused:
a) We can not coalesce mem regions, since 4-8 threads are always in use.
b) as one thread dies, some collections occur, but the other threads keep allocating.
c) never "rests" to give the GC time to coalesce everything back to a clean point.
d) this just gives ever rising memory counters.
e) Running "Performance Explorer" inside VStudio shows a bunch of Ints, int[], etc
in Gen 2, and LOH. I think Dictionary( of dictionary(of small array of ints))
with the dictionaries holding arrays of keys, and values is the problem.
f) all these dictionaries are released in the normal way. Tried EVERYTHING to explicitly deallocate them..

Solution:
When calling Parallel.For, do not pass it a large array of things to process.
Currently I batch them into Processors * ThreadsPerProcessor chunks
Run Parallel.For on the chunks. Run GC. Repeat as necessary. This idles the cpu periodically,
giving a spiky looking CPU graph, but, it runs faster than serial, and no mem probs.

Summary:
With rest breaks the GC behaves normally. With no breaks, memory goes crazy.


This took about a day and a half to figure out.

I suspect this would also happen with the stock standard ThreadPool
as the GC is the same. Lighter threads, without so much alloc/dealloc
would probably not have this problem.
Each of my threads is using 10-60 megs. 8 of them would need half a gig.
This is about 1/3 of the max memory for a 32 bit process. When running the
bad way, recs/sec would drop off steadily until OOM.

Moral of the story:
When running in parallel, make sure you take a breather now and then..
 
Robert said:
Quick summary:
1) running one class serially, all is well
2) running same thing in parallel, Gen 2 bytes go way up, and LOH usage
goes way up.
3) the classes share ZERO state. The only thing they share is a callback
to the GUI
reporting the number of records processed, and records fixed. The classes
do a bunch
of tallying, verifying no dupes, etc. Should parallelize very well..
4) I am using the TPL library, and was using Parallel.For, to spawn of
instances for each
file.
5) This causes the GC to get very confused:
a) We can not coalesce mem regions, since 4-8 threads are always in
use.
b) as one thread dies, some collections occur, but the other threads
keep allocating.
c) never "rests" to give the GC time to coalesce everything back to a
clean point.
d) this just gives ever rising memory counters.
e) Running "Performance Explorer" inside VStudio shows a bunch of
Ints, int[], etc
in Gen 2, and LOH. I think Dictionary( of dictionary(of small
array of ints))
with the dictionaries holding arrays of keys, and values is the
problem.
f) all these dictionaries are released in the normal way. Tried
EVERYTHING to explicitly deallocate them..

Solution:
When calling Parallel.For, do not pass it a large array of things to
process.
Currently I batch them into Processors * ThreadsPerProcessor chunks
Run Parallel.For on the chunks. Run GC. Repeat as necessary. This
idles the cpu periodically,
giving a spiky looking CPU graph, but, it runs faster than serial, and no
mem probs.

Summary:
With rest breaks the GC behaves normally. With no breaks, memory goes
crazy.


This took about a day and a half to figure out.

I suspect this would also happen with the stock standard ThreadPool
as the GC is the same. Lighter threads, without so much alloc/dealloc
would probably not have this problem.
Each of my threads is using 10-60 megs. 8 of them would need half a gig.
This is about 1/3 of the max memory for a 32 bit process. When running
the
bad way, recs/sec would drop off steadily until OOM.

Moral of the story:
When running in parallel, make sure you take a breather now and then..

Comment from an old bloke:

I love your last comment, "take a breather now and then", completely spot
on!

Threading is a double edged sword. While it is an enormously useful tool, if
it is not thought out well, issues such as memory usage and process
contention can be a real PITA.

My own disaster was to try and get a process to generate an invoice, persist
that invoice and optionally email or print the invoice. The staff were
delighted until the printing/emailing step hung once in a while.

So now, the staff are back to a quick coffee break when printing/emailing
invoices (approx 1k jobs). Still does the whole thing in under 5 minutes.

Cheers
 
Robert,

A lot of words, do you also have some code, you tried EVERYTHING, can you
give us any idea what is in your mind EVERYTHING?

Cor
 
Cor Ligthert said:
Robert,

A lot of words, do you also have some code, you tried EVERYTHING, can you give us any idea what is in your mind
EVERYTHING?

How would that help anyone? My worker code was not the problem, nor did any changes to it lead to a solution..

To reiterate:
Serial stock standard For loop - flat memory graph.. demonstrates my worker code is not the problem...
"Parallel.For GC problems" - steady increase in Gen 2 memory and LOH, with no shared state. EXACT same worker code!
Memory graph goes from 50mb to 1800mb in about 5 minutes. nice steep upward slope.. Then OOM crash.

Q-ing workers in bulk was the problem.

The ONE line of code that needed fixing: Parallel.For(0, mFiles.count - 1, AddressOf ParallelProcessFile)

To get a flat memory graph:
1) Batch allocate workers - Parallel.For(0, SubsetOfFiles .count - 1, AddressOf ParallelProcessFile)
2) wait for batch completion
3) call GC so it can coalesce memory regions
4) Goto 1

creating the subset is left as an exercise for the reader..

Moral of the story:
Calling GC after one worker ends does not help, because the other threads are still consuming mem.
Let ALL the workers rest, GC has access to ALL mem without interference and it will do a thorough clean up.
 
Back
Top