Tony Hill said:
It does, but the difference is small, usually less than 10% and often
much closer to 0%.
No, it's not. The Opteron builds the best 4-cpu SMP system out there
according to the SPECrate2000 cpu benchmark, but in order to get that
best result, you need to pin the individual processes to cpus and
memory using a utility. Without it, the performance is no longer the
best. So people really care about that last bit of performance.
Now I don't have the directly comparison for that, but here's a
comparison on some benchmarks for a recent competitive bid. "Slow" is
a system without the processor binding and with "node interleave"
turned on. "Fast" is with processor binding and node interleave off,
which lets the processor binding have the best benefit. Note that it's
only a trivial amount of work to get this improvement for a serial
code, so this is a common situation, although these benchmarks are, of
course, particular to this scientific-computing customer. In these
results, the comparison is scaling for 4 processes on a 4 cpu machine.
4.0 would be a perfect score.
fast slow difference
benchmark 1 3.71 3.03 + 22 %
benchmark 2 3.76 3.29 + 14 %
benchmark 3 3.78 3.26 + 16 %
benchmark 4 3.79 3.45 + 10 %
benchmark 5 3.92 3.89 + 1 %
benchmark 6 3.88 3.71 + 5 %
These benchmarks were run with the best Opteron compiler, so this
scaling improvement was very good to see. And it's bigger than
"usually less than 10%".
When well over 90% of your memory access is coming
from cache anyway and (assuming a totally random distribution in a
strictly UMA setup) 50% of your memory access is going to be local,
most of the performance difference is lost in the noise.
Handwaving is a bad way to evaluate effects like this.
I've said it before and I'll say it again: Hardware is cheap,
software is expensive. It would be a true disservice to your
customers to tell them to spend thousands upon thousands of dollars
changing all their software for the small improvement in performance
equal to a few hundred dollars of hardware costs.
Customers know what 10% or 20% more performance means, as do vendors
who are doing competitive bidding. The fact that I care a lot about
this should give you a clue. And in some cases, such as serial codes,
the benefits are easy to achieve. It took only a moderate amount of
work in our OpenMP compiler and runtime to get these benefits for some
parallel programs, too. Well worth it to our customers.
-- greg
speaking for myself, not PathScale