Yousuf said:
Probably even before that, when it was still just a theory just being
bantered about. I guess EV8 was the first product they talked about
having it, that I can recall.
Indeed. The EV8 was designed to be wide, but I've heard rumors that it
was planned to be 8 wide before they settled on 4 way SMT.
Yeah, ancestral SMT, about as relevant as dinosaurs.
It's not simultaneous, i.e. it's just time slice MT.
The retirement capabilities of the P4 *were* a bottleneck for P4, in
addition to memory latency. The retirement capabilities of all other
architectures were much higher than P4's, x86 or not.
Um...can you provide evidence to back those statements up? K7/K8 and
P3 can only retire 3 micro operations/cycle.
I have yet to see any conclusive proof that the P4 was retirement
limited. Can you cite any serious studies which show retirement as a
bottleneck?
I have done some performance analysis for a broad spectrum of
benchmarks on the P4, and I see very little which indicates retirement
is an issue. In fact, if anything, I see evidence that the bottleneck
lies with other elements of the design.
Yeah, and then they invented caches and all of a sudden memory latency
is not that much of an issue, and other parts of the processor do become
an issue.
You do realize that caches have been around long before MTA was
designed, right? In fact, not only were caches around, but caches had
been integrated into CPUs. The Tera and Cray folks really didn't
believe in caches, because for some applications they are useless.
You also realize that even with caches, multithreading is required to
tolerate cache miss latency, which substantially contributes to CPI.
Yeah, or it could just take instruction streams from independent
processes. That's what Hyperthreading was doing most of the time.
That's not exactly the processor's fault, that problem lies with the
OS...
Who cares about memory bottlenecks?
I dunno, why don't you ask someone who designs MPUs for a living,
they'd probably tell you almost everyone.
Stop diverting the subject, David.
You're the only one bringing up memory bottlenecks as the reason for
SMT.
It is one reason, there are others. However, the biggest benefit of
multithreading is to alleviate memory bottlenecks. Look at the
performance gain that SoEMT provides on Northstar versus SMT on the
POWER5. SoEMT provides most of the benefits of SMT...and all it does
is switch on memory stalls.
Really? Let me quote for you what the creators of SMT said:
"The objective of SMT is to substantially increase processor
utilization in the face of
both long memory latencies and limited available parallelism per
thread."
Gosh it sounds to me like the folks who devised SMT thought that long
memory latencies were an issue that was important to address.
Ironically enough, the folks at UW were working closely with DEC, which
designed the EV7....which had both an integrated memory controller and
truly glueless and scalable MP.
That may have been the case in ancient
times, but now we have caches and inboard memory controllers. These
days, SMT is used to increase IPC, which means increasing the
instruction retirement rate.
Yousuf, I think we are talking past each other. You're saying that SMT
should increase retirement rate, which is true. It is trivial that any
change in architecture which improves performance, while leaving path
length and frequency unchanged must improve IPC...which must improve
the retirement rate.
I'm saying that SMT improves performance (i.e. improves IPC) because it
enables you to extract more memory parallelism and overlap many cache
accesses. i.e. it alleviates memory bottlenecks.
However, your assertion that the P4 is retirement bound is simply
wrong. AFAIK, all server or desktop MPUs, ignoring Niagara, achieve
less than half their peak retirement rate. IOW, there is no way that 2
way multithreading could be a problem.
DK