T
Tom Anderson
Hello!
We're working on performance-tuning an application. I won't go into the
details, but basically, there's a stack of tasks: tasks are popped off the
stack, executed, and any resulting new work units (of which there are
generally hundreds) pushed back onto the stack. Effectively, the app is
making a depth-first traversal of a tree, in which the nodes are tasks. We
limit the depth to which we explore the tree, and also the total number of
tasks done, so that we're not doing this forever. Along the way, results
are generated, and these are accumulated in a list; the results have a
'goodness' metric, and we actually only keep track of the 20 best so far.
The tasks and results are quite small, and the whole app runs in <100 MB
of memory. We have plenty more than that, so there's no paging (i see <10
disk transfers per second in perfmon during a run). I don't know about the
hit rate on the CPU caches - i don't know how to measure that. As a
result, the app seems to be CPU-bound: it consistently uses >95% of the
CPU while running.
When it's single-threaded, that is. But one thing we've done is to make
the app multithreaded: the decomposition of the work into tasks made this
very simple, in that we just wrapped locks around the stack's push and pop
operations, and the result list's add operation, and then fired several
threads into the main loop. We've verified that it's producing the same
results as when it was single-threaded.
We've got a speedup, but not quite as much as we'd hoped. When we run with
three threads on an otherwise unloaded four-core machine, we get all three
cores running at about 85% capacity - varying from 70% to 90%. The time
taken to finish the task increases as you'd expect from that: almost, but
not quite, three times faster than the single-threaded case.
We've tried setting processor affinity on the threads, so they aren't
moving from core to core unnecessarily, and it makes no difference.
So, why are the processors idle 15% of the time?
If a processor's idle, it's because there are no threads to run. We have
three threads; if one can't run, it's because it's blocked. Threads block
on three things: I/O, acquisition of locks held by other threads, and
sleeps (am i missing anything?). The app isn't doing any I/O (as i say
above, a handful of disk transfers a second, and it doesn't touch the
network, the screen, or anything else), and as far as i know, there are no
sleep calls.
But there are locks that get acquired on shared objects - the task stack
and the result list. There could also be locks in the VM or the core
libraries, about which i know nothing. If there was contention on those
locks, it would lead to blocking. Perfmon reports about 20 contentions a
second, which doesn't seem like a lot. The activities protected by those
locks are all fairly simple - pushing or popping a task on a stack
implented as a linked list, or traversing a list of 20 results looking for
one we can kick out, and doing a remove and an add if we find one. So,
even if there was contention over those locks, i wouldn't expect the
waiting thread to have to wait very long - just the few microseconds it
takes to do those operations. 20 times a few microseconds doesn't add up
to 15%, IYSWIM!
Nevertheless, we tried to reduce contention: we went from pushing child
tasks onto the stack one by one to batching them up and pushing a whole
family of siblings onto the stack at once, under a single lock
acquisition. We split the result list into one per thread, with a merge
into a global list at the end of the job, so there's no contended locking
there at all during the run. We got the measured contentions down to 3 per
second, but there was no change in CPU use or throughput.
We tried increasing the number of threads to greater than the number of
cores, so that if one thread gets blocked, another can do some useful work
with the processor. We did then get higher CPU use, but not much more
throughput, presumably because of the overhead from having more threads.
We'd really like to get less overhead with the three-thread case - as much
for intellectual satisfaction as delivering business value to the client!
So, any thoughts? Is there an intrinsic overhead in the VM when using
multiple threads?
One thing we haven't looked into is garbage collection: the task objects
are small, but we allocate a lot of them, and the nature of a stack is
that some will be very short-lived, and so die cheaply in eden, but some
will live for almost the whole duration of the run. The long-lived tasks
will need to be swept and moved by the GC. Is the CLR's GC able to do that
in parallel, or does it need to stop mutator threads when it does that? If
so, could that be an explanation for the time my worker threads spend
blocked? How would i find out?
Thanks in advance,
tom
We're working on performance-tuning an application. I won't go into the
details, but basically, there's a stack of tasks: tasks are popped off the
stack, executed, and any resulting new work units (of which there are
generally hundreds) pushed back onto the stack. Effectively, the app is
making a depth-first traversal of a tree, in which the nodes are tasks. We
limit the depth to which we explore the tree, and also the total number of
tasks done, so that we're not doing this forever. Along the way, results
are generated, and these are accumulated in a list; the results have a
'goodness' metric, and we actually only keep track of the 20 best so far.
The tasks and results are quite small, and the whole app runs in <100 MB
of memory. We have plenty more than that, so there's no paging (i see <10
disk transfers per second in perfmon during a run). I don't know about the
hit rate on the CPU caches - i don't know how to measure that. As a
result, the app seems to be CPU-bound: it consistently uses >95% of the
CPU while running.
When it's single-threaded, that is. But one thing we've done is to make
the app multithreaded: the decomposition of the work into tasks made this
very simple, in that we just wrapped locks around the stack's push and pop
operations, and the result list's add operation, and then fired several
threads into the main loop. We've verified that it's producing the same
results as when it was single-threaded.
We've got a speedup, but not quite as much as we'd hoped. When we run with
three threads on an otherwise unloaded four-core machine, we get all three
cores running at about 85% capacity - varying from 70% to 90%. The time
taken to finish the task increases as you'd expect from that: almost, but
not quite, three times faster than the single-threaded case.
We've tried setting processor affinity on the threads, so they aren't
moving from core to core unnecessarily, and it makes no difference.
So, why are the processors idle 15% of the time?
If a processor's idle, it's because there are no threads to run. We have
three threads; if one can't run, it's because it's blocked. Threads block
on three things: I/O, acquisition of locks held by other threads, and
sleeps (am i missing anything?). The app isn't doing any I/O (as i say
above, a handful of disk transfers a second, and it doesn't touch the
network, the screen, or anything else), and as far as i know, there are no
sleep calls.
But there are locks that get acquired on shared objects - the task stack
and the result list. There could also be locks in the VM or the core
libraries, about which i know nothing. If there was contention on those
locks, it would lead to blocking. Perfmon reports about 20 contentions a
second, which doesn't seem like a lot. The activities protected by those
locks are all fairly simple - pushing or popping a task on a stack
implented as a linked list, or traversing a list of 20 results looking for
one we can kick out, and doing a remove and an add if we find one. So,
even if there was contention over those locks, i wouldn't expect the
waiting thread to have to wait very long - just the few microseconds it
takes to do those operations. 20 times a few microseconds doesn't add up
to 15%, IYSWIM!
Nevertheless, we tried to reduce contention: we went from pushing child
tasks onto the stack one by one to batching them up and pushing a whole
family of siblings onto the stack at once, under a single lock
acquisition. We split the result list into one per thread, with a merge
into a global list at the end of the job, so there's no contended locking
there at all during the run. We got the measured contentions down to 3 per
second, but there was no change in CPU use or throughput.
We tried increasing the number of threads to greater than the number of
cores, so that if one thread gets blocked, another can do some useful work
with the processor. We did then get higher CPU use, but not much more
throughput, presumably because of the overhead from having more threads.
We'd really like to get less overhead with the three-thread case - as much
for intellectual satisfaction as delivering business value to the client!
So, any thoughts? Is there an intrinsic overhead in the VM when using
multiple threads?
One thing we haven't looked into is garbage collection: the task objects
are small, but we allocate a lot of them, and the nature of a stack is
that some will be very short-lived, and so die cheaply in eden, but some
will live for almost the whole duration of the run. The long-lived tasks
will need to be swept and moved by the GC. Is the CLR's GC able to do that
in parallel, or does it need to stop mutator threads when it does that? If
so, could that be an explanation for the time my worker threads spend
blocked? How would i find out?
Thanks in advance,
tom