[Features a threadpool should have]
| 1 - Threads are dynamically added and removed form the queue depending
on
| load.
What defines "load" here? Is it Q size or cpu load. CPU load logic may
compete against min/max thread parms. So what is algo here?
I think it would be more cpu load than anything else. I'm not sold however
on this being the correct metric.
The behavior that I explicitly want to avoid is dumping 100 items into a
Queue, and having the thread pool immediatly spin up 100 threads. Likewise,
if 100 threads are spun up, I wouldn't want to see them all terminated
immediatly should there be no items in the pool.
The System thread pool seems to have a timeout of some sort before spawning
new threads. This seems to work fairly well.
| 2 - Threads have some degree of processor affinity in order to minimize
| cache issues
[Thread Affinity]
IMO, the system will do a better job then we can do here. The successful
usage cases of changing processor affinity are rare. However, I agree
there
may be special cases - but wonder if that will be abused if available.
What
are your thoughts here?
I tend to agree with you here, and as a result I have never myself written
code that has processor affinity. Windows has always done a good job of
things.
However I've been runnnig alot of tests latley on this machine:
http://www.coversant.net/bigiron.png
That particular machine has 16 physical Itanium2 processors, and some of
it's behavior has been unexpected. We consistantly see certain processors
load signifigantly higher than others, and it's tough to tell quite why.
Down deep in the network stack, there are some issues (which are solved by
the new Chimney features in the Scalable Network Pack) with interrupts and
processor affinity, so this may explain some of it. On the other hand, there
seems to be room for improvement. This is a more of a feeling than a fact,
I'll readily admit. It's also compounded by the lack of profiling tools that
run on Itanium (there are a few, but they're not nearly as usefull for .Net
as the Compuware stuff is).
I would really like to be able to have a way to figure out what processor
cache (especially on these Itanium2 machines with their HUGE caches) my
datastructure is already in, have a thread on that processor wake up,
process the item from the cache, write the results to main memory. I don't
think I'm going to see this any time in the furture, but it's on my wish
list. I have that feeling (I know, I know) this will give a fiarly
signifigant boost to performance as the load increases.
| 3 - The scheduler is smart about assigning work items to threads based
on
| processor load.
What you mean? tia
Many of the threadpools that I have seen use this algorithm:
1 -thread that wakes up
2 - thread checks for shutdown
3 - thread checks for a work item
4 - thread processes work item
5 - thread goes back to sleep
6 - goto 1
Some use variations on this, and fancy means of putting threads to sleep -
I've seen a few algorithms:
1 - Keep all threads blocked in a Montior. Whenver a work items comes in, an
"PulseOne" (or, god help us, a Pulse-All with a "First one wins" algorithm)
2 - Use Events and Waits to test for shutdown, items in the queue, etc.
3 - Items are posted to a Queue, a control thread wakes up and gives the
work item to a particular work thread.
I would really like to see an algorithm that has, in the "busy" case, as few
context switches as possible. When I post an item to a queue, there should
be exactly 1 context switch needed to process my item.
| 5 - Intelligent about locking on the work item queue
At a minimum, you need to lock on add and remove. Are you thinking of
something specific, or just seen poor locking in the TPs you reviewed?
I am thinking of some specific stuff. That 16 processor Itanium box has
signifigantly influenced the way I think about locking and contention.
Typically a queue has two locks - one on Enqueue, one on Dequeue. This is
how I've been coding it for years. However, after seeing some performance
that really puzzled me, I spent some time looking into the contention in our
system. While doing that, I came across this:
http://makeashorterlink.com/?Q4132185D
I played with this test for a while, and ran it on single, dual, quad, and
16 processor boxes, and got some shocking results. After looking into this
more and more, I became quite alarmed at just how much contention we were
seeing. It was.... shocking. I have all the numbers recorded, and will be
doing a blog entry on my findings here in the not-too-distant future.
Anyway, after much wailing and gnashing of teeth, I stumbled across
thread-safe, lock-free datastructures:
http://www.boyet.com/Articles/LockfreeQueue.html and have been educating
myself.
I added these into the test suites, and got more interesting results. On a
single or dual processor machine they're signifigantly slower. On a 16 way
machine, they're waaaaaay faster.
The end result, is that on different processor architectures, and different
numbers of processors, the locking mechanisms need to change. The difference
in some cases was a full 4 orders of magnitude (in one very dramatic case,
what took 85 milliseconds on a single processor x86 machine, took 200
milliseconds on a dual-processor machine, took 2 minutes on a quad processor
machine, and took more than 30 minutes on a 16-way machine). Granted these
cases were contrived and almost nothing but thread contention, but they
illustrated the case and let me pick the right datastructures.
My hypothetical thread pool would take this into account and have a variety
of locking mechanisms. It would even be especially nice if, in a general
some case, a work item could be handed directly to a thread with no lock
required at all. In the "lots of threads awake" case, it seems as if there
should be a way to do this.
Now, I'll be the first to admit that locking and unlocking on the queue for
work items shouldn't be too big a hit, especially when compared to parsing
some xml, string manipulation, etc - but these locks are hit at least twice
per operation (one in, one out), and at 10k operations per second, that adds
up quick.