I think a lot of people are aware that an Opteron system has less
bandwidth restrictions with a lot of processors, but that woodcrests
don't have as good a memory controller and fall behind opterons after
4 cores or so. I'm asking how severe this is. Heavy number cruncing of
huge data sets in RAM is a bandwidth intensive operation. So, I'm
asking how badly woodcrests are impacted above 4 cores, for example, 8
cores vs 4 cores, on bandwidth performance. I didn't think this was
that vague, is there anything else I can tell you that will make the
question less difficult to answer?
Your question is difficult to answer because you'd first need to know
(at least approximately) what's the ratio of
FLOPS vs memory accesses, and the pattern of those accesses. It all
boils down to that. If your program
can keep the CPU busy during "long" stretches of time without needing
to access the memory bus, then your
program will definitely benefit from more cpus/cores. If, on the
other hand, your program needs to request
(i.e. load/store) to main RAM (i.e. cache misses) very frequently,
then you will have contention on the memory
bus and your performance per cpu will degrade.
You ask "how badly" will your app degrade; well, the actual way to
model and predict that would be using the hardware performance
counters (OProfile under Linux, cputrack on Solaris, etc), and then
you'd get an idea about the rate of instructions vs anything else
(load/stores
to ram, retired FLOPS, cache misses, TLB misses, etc). But of
course the best way is to measure your program on the real thing.
I wanted to post this even if it's a bit late on the thread because
right now I have exactly this kind of problem.
We're trying to figure out if a dual-Quadcore (Xeon) will be better
(cost/benefit wise) than a 4-way Opteron dualcore, for *our* program.
Spec CPU 2006 can give you some pretty good insights on this: go to
the advanced query option, and list all available results,
but filter by "number of total cores" equal to 8. Go straight to the
int_rate and fp_rate figures, and you'll be able to compare how
4-way dual Opterons compare to (Xeon) dual-Quadcores. At least, on
the Spec-2006 suite, whose programs have working set sizes quite
big, although they may not be as RAM-bottlenecked as your particular
program.
As you say, Opterons do definitely have a much better memory system.
But then a 4-way mobo is WAY more expensive that a dual-socket one...
And btw, if you want to benchmark just memory bandwidth/latency
performance, STREAM (
http://www.cs.virginia.edu/stream/)
is the way to go.
Cheers,
JL