D
David Kanter
krw said:I believe the IBM 360/195 and 370/95 both had SMT. It was called
"dual I-stream". Two "threads" were executed (I believe)
simultaneously.
Keith, if you mean the RS64-IV, that was not SMT; it was switch on
event multithreading. i.e. one thread would continue to execute until
a stall condition occured, and then it would switch threads.
[snip]
I don't believe memory latency is the issue at all, at least not
directly. Even SMT can't solve a 500 instruction "hole". SMT is
supposed to cover pipe flushes caused by branches (and
mispredicts). The other thread can still execute (utilize
execution units) while the first flushes and refills.
500 instructions might be feasible. Assume that 1/3 of instructions
generate memory requests, and the on-die caches have a hit rate of
99.5% --> 1/200 memory refs miss in cache --> 1/600 instructions cause
an off-die memory reference. Even if you assume 2/5 of instructions
are mem refs, you still end up with 1/500.
Sure, but they're inevitable.
Yes, but they should be minimized.
I'm not buying it. As I said, the primary problem that SMT is
trying to solve is pipe bubbles caused by branch mispredicts.
I agree that it's a benefit, but I have a hard time seeing that as a
bigger motivator than cache misses. If you look at any realistic CPI
breakdown, memory is always the biggest component, by a long shot.
Here's a paper on the subject, identifying branch misprediction as a
minor problem for OLTP workloads:
http://www.cs.cmu.edu/~damon2006/pdf/saylor06oltp.pdf
Here is an analysis of the POWER5:
http://www-128.ibm.com/developerworks/power/library/pa-cpipower2/?ca=dgr-lnxwCPIP2
In each case, the branch prediction penalty is miniscule compared to
cache misses.
X86, perhaps. When you're register poor and there is a branch every
five instructions this isn't surprising.
Actually x86 has the highest IPC chip for SPECint:
Chip - SPECint2000 score
P5+ 2.3GHz - 1820
Woodcrest 3GHz - 3089
Opt. 3GHz - 1942
P4 Xeon 3.8GHz - 1854
Itanium 1.6GHz - 1590
Converting that into SPECint/GHz you get:
P5+ - 791
Xeon - 1029
Opteron - 647
P4 Xeon - 487
Itanium - 993
SPECint/GHz will be proportional to IPC for these processors. So
actually x86 has the highest IPC processor, followed by IPF, then PPC.
Note that this comparison is only using server processors, while
desktop processors are slightly faster due to lower memory latency.
DK