65nm news from Intel

  • Thread starter Thread starter Yousuf Khan
  • Start date Start date
In comp.sys.ibm.pc.hardware.chips Peter Boyle said:
What is the evidence to back up this claim?

Logic. When else can SMT really do net increased work?
If you want to test, run some pointer-chasers.
I would however claim that functional units are almost free,

This is getting more and more true as caches grow, but only
from an areal perspective. A multiplier still sucks back a
huge amount of power and tosses it as heat.
CMP will also help the former [bandwidth].

Nope, not without a second memory bus and all those pins.

-- Robert
 
|>
|> > Code type matters. SMT is best for continuing work during
|> > the ~300 clock memory fetch latency.
|>
|> What is the evidence to back up this claim?
|>
|> Not theories, but _evidence_ of bigger speed up compared to,
|> for example, switch on event multi-threading, or CMP with simpler
|> and smaller processors, but not sharing L1 cache.
|>
|> Note that I'm not claiming evidence the other way, but as far as
|> I can tell the jury is out on the best organisation for concurrency
|> on chip.

I should be happy to see even a theoretical analysis - I wasn't
impressed by Eggers's omission of a comparable CMP for comparison
purposes.


Regards,
Nick Maclaren.
 
|>
|> I think that Nick is muddled on this one. If the base implementation is
|> already OoO then there will normally be many more physical registers than
|> architected ones. To go two-way SMT may not involve adding any physical
|> registers, but rather involve changes to renaming. "dual port" every
|> execution unit doesn't make much sense to me. Access to execution units from
|> either virtual processor is essentially free - they are after all virtual
|> processors, not real. What is required is that every bit of *architected*
|> processor state be renamed or duplicated, prehaps that's what Nick is
|> getting at?

You haven't allowed for the problem of access. A simple (CMP)
duplication doesn't increase the connectivity, and can be done
more-or-less by replicating a single core; SMT does, and may need
the linkages redesigning. This might be a fairly simple task for
2-way SMT, though there have been reports that it isn't even for
that, but consider it for 8-way.

|> > You have to mangle any performance counters and many privileged
|> > registers fairly horribly, because their meanings and constraints
|> > change. Similarly, you have to add logic for CPU state change
|> > synchronisation, because some changes must affect only the current
|> > thread and some must affect both. And you have to handle the case
|> > of the two threads attempting incompatible operations simultaneously.
|>
|> What operations are incompatible. SMT as implemented in the Pentium 4, say,
|> allows either virtual processor to do what it likes. One can transition from
|> user to kernel and back while the other services interrupts or exceptions or
|> whatever. The only coordination needed for proper operation is what is
|> needed for two processors - of course the performance may suffer though.

ABSOLUTELY NOT!

Look at the performance counters, think of floating-point modes
(in SMT, they may need to change for each operation), think of
quiescing the other CPU (needed for single to dual thread switching),
think of interrupts (machine check needs one logic, and underflow
another). In ALL cases, on two CPUs, each can operate independently,
but SMT threads can't.

|> yes lots of speculation. The difference here is that to CMP processor take
|> about twice the silicon of one, while with SMT you have the option to use
|> 1.5 cores worth of silicon. Perhaps once >dual cores is cheap and easy SMT
|> will die because its more effort than its worth, but my bet is that chips
|> will go both routes with SMT and CMP. Just one more little problem for the
|> OS developers to deal with :)

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).


Regards,
Nick Maclaren.
 
Nick Maclaren said:
|>
|> I think that Nick is muddled on this one. If the base implementation is
|> already OoO then there will normally be many more physical registers than
|> architected ones. To go two-way SMT may not involve adding any physical
|> registers, but rather involve changes to renaming. "dual port" every
|> execution unit doesn't make much sense to me. Access to execution units from
|> either virtual processor is essentially free - they are after all virtual
|> processors, not real. What is required is that every bit of *architected*
|> processor state be renamed or duplicated, prehaps that's what Nick is
|> getting at?

You haven't allowed for the problem of access. A simple (CMP)
duplication doesn't increase the connectivity, and can be done
more-or-less by replicating a single core; SMT does, and may need
the linkages redesigning. This might be a fairly simple task for
2-way SMT, though there have been reports that it isn't even for
that, but consider it for 8-way.

I think we must be talking at cross purposes because to me an 8-way SMT is
very little different from a 2-way. Bigger register files for architected
state and a few more bits into the renamer. I don't know what you mean by
linkages in this context. Linkages between what and what?
|> > You have to mangle any performance counters and many privileged
|> > registers fairly horribly, because their meanings and constraints
|> > change. Similarly, you have to add logic for CPU state change
|> > synchronisation, because some changes must affect only the current
|> > thread and some must affect both. And you have to handle the case
|> > of the two threads attempting incompatible operations simultaneously.
|>
|> What operations are incompatible. SMT as implemented in the Pentium 4, say,
|> allows either virtual processor to do what it likes. One can transition from
|> user to kernel and back while the other services interrupts or exceptions or
|> whatever. The only coordination needed for proper operation is what is
|> needed for two processors - of course the performance may suffer though.

ABSOLUTELY NOT!

Look at the performance counters, think of floating-point modes
(in SMT, they may need to change for each operation), think of
quiescing the other CPU (needed for single to dual thread switching),
think of interrupts (machine check needs one logic, and underflow
another). In ALL cases, on two CPUs, each can operate independently,
but SMT threads can't.

I don't see this at all. I'm not saying these things are trivial, I'm saying
that most of it has to be done for a single threaded OoO CPU too.
|> yes lots of speculation. The difference here is that to CMP processor take
|> about twice the silicon of one, while with SMT you have the option to use
|> 1.5 cores worth of silicon. Perhaps once >dual cores is cheap and easy SMT
|> will die because its more effort than its worth, but my bet is that chips
|> will go both routes with SMT and CMP. Just one more little problem for the
|> OS developers to deal with :)

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).

While I'm looking at the cost of make a single threaded OoO CPU into a
multithreaded one. Thatn probably explains much of the disparity above. If I
had enough silcon for two OoO CPU's I'd probable take the extra hit (5%-30%,
or whatever) to add SMT to each core. If the game is how to get the max
performance (by some measure) from a given area of silicon then I'd have to
know how big it is - if its just too small for two seperate cores...
Regards,
Nick Maclaren.

Peter
 
|> yes lots of speculation. The difference here is that to CMP processor take
|> about twice the silicon of one, while with SMT you have the option to use
|> 1.5 cores worth of silicon. Perhaps once >dual cores is cheap and easy SMT
|> will die because its more effort than its worth, but my bet is that chips
|> will go both routes with SMT and CMP. Just one more little problem for the
|> OS developers to deal with :)

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).


Regards,
Nick Maclaren.

And for those who don't know, I make more detailed explanation. the most
papers on the matter seem to speak about same number of execution units,
BUT in reality large part of the area N² areal complexity so you in
REALITY could get 1.4 or 1.33 with 6%% SMT overhead number of execution
resources EXCEPT cache for same area for SINGLE compared to TWO
processors. And doubling cache typically gives 10% increase overall. So
thats not a disadvantage for CMP, version that DON'T share caches, on
the contrary it reduces cache conflicts. And those 1.3 times execution
resources don't mean 1.3 times single threaded performance as MOST time
you could use only small portion of resources, for single thread. So 2
core CMP for instance vs SMT would be 6 way VS 8 way for SMT, for
similar area... And for having separate caches wouldn't hurt too much.
Especially if other CPU could have quickly access on 2nd CPU L2 ... With
OWN set of L2 tags for it, and without updating the other CPU:s L2$ LRU
state and shared victim cache .... Now you have about twice the cache
bandwith, and less cache latency, and can avoid strange cache conflicts
between threads. Besides shared L2 I$ helps on I$ hit rate...
Yes there needs to be balancing between having more cores, and more
powerful cores, but current papers on matter penalize CMP because they
don't take in account any kind of design trade offs that make CMP
machine have MORE execution resources in total, and less cache
contention and lower latencies on a hit.

Jouni Osmala
 
|> >
|> > You haven't allowed for the problem of access. A simple (CMP)
|> > duplication doesn't increase the connectivity, and can be done
|> > more-or-less by replicating a single core; SMT does, and may need
|> > the linkages redesigning. This might be a fairly simple task for
|> > 2-way SMT, though there have been reports that it isn't even for
|> > that, but consider it for 8-way.
|>
|> I think we must be talking at cross purposes because to me an 8-way SMT is
|> very little different from a 2-way. Bigger register files for architected
|> state and a few more bits into the renamer. I don't know what you mean by
|> linkages in this context. Linkages between what and what?

Between the register file and the execution units, and between
execution units. The point is the days when 'wiring' was cheap
are no more - at least according to every source I have heard!

|> > Look at the performance counters, think of floating-point modes
|> > (in SMT, they may need to change for each operation), think of
|> > quiescing the other CPU (needed for single to dual thread switching),
|> > think of interrupts (machine check needs one logic, and underflow
|> > another). In ALL cases, on two CPUs, each can operate independently,
|> > but SMT threads can't.
|>
|> I don't see this at all. I'm not saying these things are trivial, I'm saying
|> that most of it has to be done for a single threaded OoO CPU too.

No, they don't. Take performance counters. In an OoO CPU, you have
a single process and single core, so you accumulate the counter and,
at context switch, update the process state. With SMT, you have
multiple processes and multiple cores - where does the time taken
(or events occurring) in an execution unit get assigned to? The
Pentium 4 kludges this horribly.

Consider mode switching. In an OoO CPU, a typical mode switch is
a synchronisation point, and is reset on a context switch. With
SMT, a mode must be per-thread (which was said by hardware people
to be impossible a decade ago).

Consider interrupt handling. Underflow etc. had better be handled
within its thread, because the other might be non-interruptible
(and think scalability). But you had BETTER not handle all machine
checks like that (such as ones that disable an execution unit, in
a high-RAS design), as the execution units are in common.

Consider quiescing the other CPU to switch between single and dual
thread mode, to handle a machine check or whatever. You had BETTER
ensure that both CPUs don't do it at once ....


Regards,
Nick Maclaren.
 
Robert Redelmeier said:
Logic. When else can SMT really do net increased work?
If you want to test, run some pointer-chasers.


This is getting more and more true as caches grow, but only
from an areal perspective. A multiplier still sucks back a
huge amount of power and tosses it as heat.

Which is an argument for SMT over CMP. With SMP, you can "share" one
multiplier between the two threads (assuming they are not both heavy users
of multiply - which is true for lots of server type workloads), wheras a CMP
would require two multipliers with all the power and heat issues that
implies.
 
snip
I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).

Probably because it can't be done. I think virtually everyone here believes
that the extra silicon area for a two way SMP is much less than 100% of the
die area of the core. Thus a two way SMP will use less die area, power,
etc. than a two way CMP and the comparison that you specify can't be done.
Let me repeat, I am not an SMP bigot. It seems to me that it is a usefull
tool, along with others, including CMP in the designers tool box. As
someone else has said, I expect the future to be combinations of both, along
with multiple chips per PCB and multiple PCBs per system.
 
first{dot} said:
I think that Nick is muddled on this one. If the base implementation is
already OoO then there will normally be many more physical registers than
architected ones. To go two-way SMT may not involve adding any physical
registers, but rather involve changes to renaming. "dual port" every
execution unit doesn't make much sense to me. Access to execution units from
either virtual processor is essentially free - they are after all virtual
processors, not real. What is required is that every bit of *architected*
processor state be renamed or duplicated, prehaps that's what Nick is
getting at?

Sure, simply tag the register names with the thread ID and let the
renaming take care of sorting out the threads' architected resources.
Pretty much everything in a modern processor has to be renamed anyway.
 
Probably because it can't be done. I think virtually everyone here believes
that the extra silicon area for a two way SMP is much less than 100% of the
die area of the core. Thus a two way SMP will use less die area, power,
etc. than a two way CMP and the comparison that you specify can't be done.
Let me repeat, I am not an SMP bigot. It seems to me that it is a usefull
tool, along with others, including CMP in the designers tool box. As
someone else has said, I expect the future to be combinations of both, along
with multiple chips per PCB and multiple PCBs per system.

In the above, you mean SMT, I assume.

It's been possible for at least 5 years, probably 10. Yes, the cores
of a CMP system would necessarily be simpler, but it becomes possible
as soon as the transistor count of the latest and greatest model in
the range exceeds doubt that of the simplest. Well, roughly, and
allowing for the difference between code and data transistors.


Regards,
Nick Maclaren.
 
keith said:
Well, I was talking about single-chip SMP. Even at that it was rather
obvious (I believe I argued with Fleger over this). What else to do with
infinite transistor budgets after caches?

Uh, Fleger here. ;-)

We have long had desktop SMP available. Question: what legacy
software runs faster on two cores (whether on one or two chips) than
on one? Answer: none.

Desktop SMP has always been for the person who wants to check his
email while a compile is in progress. SMP == workstation. Useless
for the vast majority of PC users, although there is a heavy
sprinkling of workstation users in this NG. Keith, for instance.

One of the stories I got about dual-core (*initially* dual) cpus is
that they were to solve the heat problem. So we just had IDF where
200 watt heat-sinks were on display for dual-core CPUs. What??

I agree that almost every server will run faster with multicore CPUs.
I strongly disagree that multicores will benefit Joe MS Office or Joe
IE. As Keith points out, I had this opinion 5 years ago. I see no
reason to change.
 
Felger Carbon said:
We have long had desktop SMP available. Question: what legacy
software runs faster on two cores (whether on one or two chips) than
on one? Answer: none.

Photoshop, and probably most other professional audio/video/graphics
programs, especially for Apple.

Oh, and make. I bought a SMP Linux computer years ago for the specific
purpose of running my compiler testsuite in half the time.
Parallelization was done with make.
 
Kees van Reeuwijk said:
Photoshop, and probably most other professional audio/video/graphics
programs, especially for Apple.

Bzzt! This is an IBM.PC NG.
Oh, and make. I bought a SMP Linux computer years ago for the specific
purpose of running my compiler testsuite in half the time.
Parallelization was done with make.

"Make" is run on workstations. It is not a legacy application for
personal computers.
 
In comp.sys.ibm.pc.hardware.chips Felger Carbon said:
We have long had desktop SMP available. Question: what
legacy software runs faster on two cores (whether on one
or two chips) than on one? Answer: none.

`make -j2 bzlilo` to make a Linux kernel runs 1.9x
Desktop SMP has always been for the person who wants to
check his email while a compile is in progress.

Checking/writing email/news uses so few cycles that I don't
notice any change in compile speed.
I agree that almost every server will run faster with
multicore CPUs.

I doubt even this. It depends on the nature of workload.
If the thing is network or disk bound, multi won't help.

-- Robert
 
Felger Carbon said:
Bzzt! This is an IBM.PC NG.

You are cross-posting to comp.arch. Please pay attention. In addition,
you might want to spend less time explaining how the average PC user
running only Word or Internet Explorer won't benefit; I don't see
anyone arguing against that.

-- greg
 
Sorry, I missed that upthread.


A very good point. SMT is a fairly simple thing.
Orthogonal to other efforts to improve performance.


True enough. You run out of orthogonalities :)


Decent? What do you classify as decent? I see'em around $200,
and surely you don't shy away from fixing painted jumpers?
I figure the dual premium is around $200 now.

Every board I looked at from Asus, Tyan, or any of the others on my short
list. ...and I don't over-clock either. ;-)
Oh, I see you're still running the K6-3. No reason to stop.

Yep, as the WinBlows system. My wife hasn't gotten used to Linux yet. I
have to beat on her to give up IE, in favor of FireFox. Actually there is
a Win boot partition on this system, but it's never been used for anything
other than bringup/test.
 
Which is an argument for SMT over CMP. With SMP, you can "share" one
multiplier between the two threads (assuming they are not both heavy users
of multiply - which is true for lots of server type workloads), wheras a CMP
would require two multipliers with all the power and heat issues that
implies.

It's not that much of a gain since you havent' doubled the decode/dispatch
width. Heat/power of unused execution units can be largly mittigated with
clock gating and other power-saving techniques. The pipeline is deep, use
that knowledge.
 
Uh, Fleger here. ;-)

We have long had desktop SMP available. Question: what legacy
software runs faster on two cores (whether on one or two chips) than
on one? Answer: none.

Available, but expen$ive. Legacy isn't everything. People will think of
new ways of using computers. Many things I do today will put a drag on a
computer and ruin the "Windows Experience". ;-) Bursty
interactive performance isn't good for productivity. I seem to keep my
laptop at work rather busy with multiple tasks. Hell, I can keep several
servers busy if I'm in the right mood and the coffee is hot enough. ;-)
Desktop SMP has always been for the person who wants to check his email
while a compile is in progress. SMP == workstation. Useless for the
vast majority of PC users, although there is a heavy sprinkling of
workstation users in this NG. Keith, for instance.

Ok, I'll let you define "workstation" such that it includes
a P3-850 running on a battery (not often). Is it my turn to define the
terms tomorrow? ;-)
One of the stories I got about dual-core (*initially* dual) cpus is that
they were to solve the heat problem. So we just had IDF where 200 watt
heat-sinks were on display for dual-core CPUs. What??

Umm, did you catch the link here earlier today, comparing the 90nm A64,
130nm A64, and 90nm P4? A P4 at >230W! Yeow! I passed that one around
the office. ;-)
I agree that almost every server will run faster with multicore CPUs. I
strongly disagree that multicores will benefit Joe MS Office or Joe IE.
As Keith points out, I had this opinion 5 years ago. I see no reason to
change.

It'll happen because there's nothing left to do with the other 500M
transistors. Build it and they will come. ;-)
 
Bzzt! This is an IBM.PC NG.

Ah said:
"Make" is run on workstations. It is not a legacy application for
personal computers.

Hell, my PS2/50Z was a workstation?! ...even when assembling 8051 code?
A new definition of workstation is born; runs MAKE.

How about surfing two (ten) web sites at once?
 
|> > Stefan Monnier wrote:
|>
|> >>> > Your second CPU will be mostly idle, of course, but so is the first CPU
|> >>> > anyway ;-)
|> >
|> > Yeah, but that's not bad.
|> > 2nd CPUs are cheap these days.
|>
|> You may htinf the second is "cheap", but I don't. The second CPU and the
|> board that dgoes with it are certainly *not* "cheap".

What board?

The cost difference is far more marketing than production. Dual
CPU boards are sold as 'servers' and as 'performance workstations',
both at a premium. They could equally well be sold with the same
margin as the 'economy' boards.

The development costs (board/chipset/BIOS) have to be recaptured across
fewer units sold, so will cost more. Look at the prices of boards with
on-board SCSI, for another example. OTOH, it doesn't cost all *that* much
more to throw another core on a chip.
 
Back
Top