15% slowdown when multithreading

Tom Anderson · Mar 18, 2009

Hello!

We're working on performance-tuning an application. I won't go into the
details, but basically, there's a stack of tasks: tasks are popped off the
stack, executed, and any resulting new work units (of which there are
generally hundreds) pushed back onto the stack. Effectively, the app is
making a depth-first traversal of a tree, in which the nodes are tasks. We
limit the depth to which we explore the tree, and also the total number of
tasks done, so that we're not doing this forever. Along the way, results
are generated, and these are accumulated in a list; the results have a
'goodness' metric, and we actually only keep track of the 20 best so far.

The tasks and results are quite small, and the whole app runs in <100 MB
of memory. We have plenty more than that, so there's no paging (i see <10
disk transfers per second in perfmon during a run). I don't know about the
hit rate on the CPU caches - i don't know how to measure that. As a
result, the app seems to be CPU-bound: it consistently uses >95% of the
CPU while running.

When it's single-threaded, that is. But one thing we've done is to make
the app multithreaded: the decomposition of the work into tasks made this
very simple, in that we just wrapped locks around the stack's push and pop
operations, and the result list's add operation, and then fired several
threads into the main loop. We've verified that it's producing the same
results as when it was single-threaded.

We've got a speedup, but not quite as much as we'd hoped. When we run with
three threads on an otherwise unloaded four-core machine, we get all three
cores running at about 85% capacity - varying from 70% to 90%. The time
taken to finish the task increases as you'd expect from that: almost, but
not quite, three times faster than the single-threaded case.

We've tried setting processor affinity on the threads, so they aren't
moving from core to core unnecessarily, and it makes no difference.

So, why are the processors idle 15% of the time?

If a processor's idle, it's because there are no threads to run. We have
three threads; if one can't run, it's because it's blocked. Threads block
on three things: I/O, acquisition of locks held by other threads, and
sleeps (am i missing anything?). The app isn't doing any I/O (as i say
above, a handful of disk transfers a second, and it doesn't touch the
network, the screen, or anything else), and as far as i know, there are no
sleep calls.

But there are locks that get acquired on shared objects - the task stack
and the result list. There could also be locks in the VM or the core
libraries, about which i know nothing. If there was contention on those
locks, it would lead to blocking. Perfmon reports about 20 contentions a
second, which doesn't seem like a lot. The activities protected by those
locks are all fairly simple - pushing or popping a task on a stack
implented as a linked list, or traversing a list of 20 results looking for
one we can kick out, and doing a remove and an add if we find one. So,
even if there was contention over those locks, i wouldn't expect the
waiting thread to have to wait very long - just the few microseconds it
takes to do those operations. 20 times a few microseconds doesn't add up
to 15%, IYSWIM!

Nevertheless, we tried to reduce contention: we went from pushing child
tasks onto the stack one by one to batching them up and pushing a whole
family of siblings onto the stack at once, under a single lock
acquisition. We split the result list into one per thread, with a merge
into a global list at the end of the job, so there's no contended locking
there at all during the run. We got the measured contentions down to 3 per
second, but there was no change in CPU use or throughput.

We tried increasing the number of threads to greater than the number of
cores, so that if one thread gets blocked, another can do some useful work
with the processor. We did then get higher CPU use, but not much more
throughput, presumably because of the overhead from having more threads.
We'd really like to get less overhead with the three-thread case - as much
for intellectual satisfaction as delivering business value to the client!

So, any thoughts? Is there an intrinsic overhead in the VM when using
multiple threads?

One thing we haven't looked into is garbage collection: the task objects
are small, but we allocate a lot of them, and the nature of a stack is
that some will be very short-lived, and so die cheaply in eden, but some
will live for almost the whole duration of the run. The long-lived tasks
will need to be swept and moved by the GC. Is the CLR's GC able to do that
in parallel, or does it need to stop mutator threads when it does that? If
so, could that be an explanation for the time my worker threads spend
blocked? How would i find out?

Thanks in advance,
tom

Paul · Mar 19, 2009

What resources does the task use? (DB, File System, Web Service)

Are you locking and then not unlocking a resource requiring other threads to
wait for a timeout before getting access to a resource...

Ben Voigt [C++ MVP] · Mar 19, 2009

We've got a speedup, but not quite as much as we'd hoped. When we run

with three threads on an otherwise unloaded four-core machine, we get
all three cores running at about 85% capacity - varying from 70% to
90%. The time taken to finish the task increases as you'd expect from
that: almost, but not quite, three times faster than the
single-threaded case.

This is expected, you're sharing a lot of resources. Let's try to reduce
that.

First thing to notice is that the TOP n OF operation you're using for result
sets is idempotent. It doesn't matter what order you accept the data into
such a computation, the result is the same. So maintain an independent
results list for each of N threads that always contains the top M results
produced by that thread. At the end, you'll have N lists of M results each
and the M overall best are guaranteed to be in there.

If your Add operation taken under lock was handling the "keep the M best"
logic, I think this is already going to boost your scalability tremendously.
If you try this and want to do still better, come back and we'll work on
unsharing the work queue.

Ben Voigt [C++ MVP] · Mar 19, 2009

Ben said:
This is expected, you're sharing a lot of resources. Let's try to
reduce that.

First thing to notice is that the TOP n OF operation you're using for
result sets is idempotent. It doesn't matter what order you accept
the data into such a computation, the result is the same. So

Actually what I described wasn't idempotence. The TOP n OF operation is in
fact idempotent, associative, and commutative. Which makes letting each
thread compute its own result and combining them at the end as I suggested
perfectly legal.

Tom Anderson · Mar 20, 2009

This is expected, you're sharing a lot of resources. Let's try to reduce
that.

First thing to notice is that the TOP n OF operation you're using for result
sets is idempotent. It doesn't matter what order you accept the data into
such a computation, the result is the same. So maintain an independent
results list for each of N threads that always contains the top M results
produced by that thread. At the end, you'll have N lists of M results each
and the M overall best are guaranteed to be in there.

If your Add operation taken under lock was handling the "keep the M best"
logic, I think this is already going to boost your scalability tremendously.
If you try this and want to do still better, come back and we'll work on
unsharing the work queue.

Sensible idea. But, as i said in my original post:

We split the result list into one per thread, with a merge into a global
list at the end of the job, so there's no contended locking there at all
during the run. We got the measured contentions down to 3 per second, but
there was no change in CPU use or throughput.

We tried it and it made no difference.

tom

Tom Anderson · Mar 20, 2009

What resources does the task use? (DB, File System, Web Service)

None of the above! It hauls data into memory from the file system at the
start (in a single thread), grinds on it for a while (in multiple
threads), and then spits the result out to disk (again in a single
thread). It's during the grinding phase that we saw this 'problem'. The
phases interact by passing objects of our own making, which don't include
any locks.

Are you locking and then not unlocking a resource requiring other
threads to wait for a timeout before getting access to a resource...

Good thought, but no, we aren't.

Anyway, we solved this not long after my post, as i will reveal in my
reponse to Peter ...

tom

Ben Voigt [C++ MVP] · Mar 20, 2009

Tom Anderson said:
Sensible idea. But, as i said in my original post:

We split the result list into one per thread, with a merge into a global
list at the end of the job, so there's no contended locking there at all
during the run. We got the measured contentions down to 3 per second, but
there was no change in CPU use or throughput.

Oh.... that is exactly what I was suggesting. Sorry for not reading more
closely/carefully.

It's very possible that memory allocation is causing contention or GC is
causing, not contention per se, but suspending the other threads while the
GC runs.

Can you use a free pool of objects to reduce allocation and GC? Also the
effort of each GC run depends on the number of active objects, if you
allocate large arrays at once there are fewer objects to keep track of.

I think having pools of work items will also help you implement a separate
queue for each thread and work stealing, which are the next things you'd do
to further reduce contention.

Another thing is that since you are using hardware parallelism, make sure
you are using a lock that tries a spinlock (for maybe a hundred cycles) to
resolve contention before resorting to a kernel wait.

Another problem is that resource sharing causes a lot more problems than
just contended locks. You have pipeline stalls whenever you need a transfer
of cache ownership and this won't show up in perfmon. With all your threads
removing items from a single stack, that's a terrible hotspot regardless of
whether you add in batches or not. The "top of stack" pointer is read and
*updated* by every thread which kills cache performance.

Tom Anderson · Mar 20, 2009

Even for as a simple question, it's very important to post a
concise-but-complete code sample that reliably demonstrates the problem.
For something like this, it's _critical_.

I suppose since Lew doesn't do C#, someone else has to have the job of
demanding SSCCEs round here

.

Peter, i appreciate the usefulness of SSCCEs for all parties, but if i'd
been able to reduce this problem to a concise-but-complete code sample
that reliably demonstrates the problem, i would have been able to solve it
myself! The whole problem was knowing where to look!

As it happened, after making this post, i went back to my colleague with a
new idea, and we managed to write a toy program which *did* reproduce the
problem - and in doing so, identified the cause, thus proving the utility
of the SSCCE!

The most likely answer is that you have too much contention between the
threads, with one or more waiting on another before it can proceed with
its task (when retrieving new work, for example).

This turned out to be true, but in an unexpected way. Well, i say
unexpected, but in hindsight, it should have been obvious. It was garbage
collection.

As i said, we're effectively making a breadth-first traversal of a tree,
and keeping unvisited branches on a stack. The branching factor of the
tree is high - a task often sires several hundred child tasks - and we go
several levels deep. That means that that (a) we allocate a huge number of
tasks objects and (b) we keep quite a large fraction of them around for
quite a long time in the stack (eg the children of the root task). Each
task object is fairly heavyweight, including all sorts of obscure details
of currency portfolios and credit positions (yes, it's a financial app)
and its relationship to other tasks. We did some memory profiling, and
each task adds up to a few kilobytes of data (i think - i haven't really
mastered the tool).

That adds up to a high allocation rate, matched by a high garbage
generation rate, but with widely-varying object lifetimes. What i think
happens is that there are enough objects living longer than the blink of
an eye to confound the nursery collector, but enough objects dying
young-ish that we don't just end up with a large, quiet tenured
generation. It's almost the worst case for the a generational collector!
And thus, our GC is working overtime to keep us supplied with fresh
memory.

We had a simple toy version of the app that didn't reproduce the 15%
slowdown. When we added a few kilobytes of ballast (just a byte array) to
each of the toy task objects, suddenly, it had exactly the same slowdown.
I take that as conclusive evidence that it was indeed GC that was taking
the time.

The fact is, for a na?ve, first-try at parallelizing an algorithm, I'd say
you've done pretty well. A 2.5x speedup isn't bad at all. There are lots of
things you _could_ have messed up, and yet I think it's pretty likely you
didn't, based on that number.

You'll never get full scalability with adding threads; there's always
_some_ overhead that will prevent you from getting to a full 3x with
three threads, even in the best, most-optimized implementation.

True. But we can put a number on it: Amdahl's law says that if you have a
task where some proportion P is parallelisable, and you go from 1 to N
processors, then your speedup S is:

S = 1 / ((1 - P) + (P / N))

Turning that round (if my algebra is right):

P = ((1 / S) - 1) / ((1 / N) - 1)

For a 2.5x speedup from 3 processors, 90% of the task would have to be
parallelisable, and so 10% serial. Looking at the implementation of the
task stack push/pop and result list add operations, and comparing that to
the work done in the tasks, there's no way the former amount to 10% of the
total work of the program.

Garbage collection, on the other hand, could very easily do so. There are
various GC algorithms that do some of the work in parallel - either in a
thread that works alongside the mutators while they're still running, or
by splitting the GC work amongst multiple threads while the world is
stopped. However, all of them still involve some serial work, and that
presents a bottleneck on parallelisation. We played a bit with the VM
configuration file, and found some flags which switched on a more parallel
collector; we didn't eliminate the 'lost' time, but we got a good speed
boost (i think we went from 12x to 16x the speed of the system as it was
handed to us).

And for most of us, the "best, most-optimized implementation" simply
isn't practical to implement. It's much easier to screw up the
multi-threading, both in terms of correctness and in terms of
performance, when you start trying to apply the techniques that allow
for less overhead.

Maybe. In general, i think the hardness of concurrent programming is much
overstated: there are problems which are really hard to solve, but i think
the great majority of real-life problems admit a concurrent solution which
is both safe and efficient (well, or obviously aren't parallelisable at
all, which is almost as good!). My problem is such a one: it was already
split into self-contained tasks which interact a three simple and
well-defined points (stack push, stack pop, result add), and that made
paralellising it easy.

We didn't find a magic solution to the GC issue, but time lost to GC isn't
really a parallel problem - single-threaded apps suffer from it too. There
are probably dimensions of GC optimisation that exist for parallel apps
but not serial ones, but i've yet to hit one.

The parallel extensions for .NET are supposed to be out soon, and
there's a Communtity Technology Preview for them. This library _has_
been implemented by experts who _do_ know how to squeeze the maximum
performance out of multi-threaded code. If you're concerned that a 2.5x
speedup isn't sufficient, you might try using the new parallel libraries
and see if that does better.

Unfortunately, I'm not really sure where to download it. The only link
I was able to find, says both that it's available only as a Virtual PC
image, _and_ that it expired on Jan 1, 2009. But, it's part of .NET
4/VS2010, so maybe if you can find a download for those previews, that
would include the parallel stuff.

I suspect this won't help our situation, but it does sound interesting,
and if i'm doing parallel C# again in the future (which i think is
unlikely, as we're mostly a java shop, and in fact mostly a java web app
shop - i don't know quite how we landed some C# batch processing work),
i'll definitely try to get hold of it (by then, it should hopefully be
more easily available!) - thanks for the tip.

tom

Peter Morris · Mar 21, 2009

Another thing is that since you are using hardware parallelism, make sure

you are using a lock that tries a spinlock (for maybe a hundred cycles) to
resolve contention before resorting to a kernel wait.

I've been reading "Concurrent Programming on Windows" and I'm fairly sure at
some point the book says that Windows will automatically spin lock for a
short period before commencing a Kernel wait. I wish I could remember where
so I could check it and then quote it :-)

Ben Voigt [C++ MVP] · Mar 23, 2009

Peter Morris said:
I've been reading "Concurrent Programming on Windows" and I'm fairly sure
at some point the book says that Windows will automatically spin lock for
a short period before commencing a Kernel wait. I wish I could remember
where so I could check it and then quote it

I don't think so. Windows does provide some synchronization mechanisms like
that, see for example
http://msdn.microsoft.com/en-us/library/ms683476(VS.85).aspx , but whether
spin counts are used looks like it is up to the framework developer, there's
no automatic spinning before every wait (nor would that be a good idea).

15% slowdown when multithreading

Tom Anderson

Paul

Ben Voigt [C++ MVP]

Ben Voigt [C++ MVP]

Tom Anderson

Tom Anderson

Ben Voigt [C++ MVP]

Tom Anderson

Peter Morris

Ben Voigt [C++ MVP]