The end of Netburst in 2006

  • Thread starter Thread starter YKhan
  • Start date Start date
Y

YKhan

X-bit labs - Hardware news - Intel Confirms New CPU Architecture to
Launch in Late 2006.
http://www.xbitlabs.com/news/cpu/display/20050512111032.html

One interesting thing they mentioned is that they're going to attempt
to retain Hyperthreading even with the new less-pipelined core.
Hyperthreading is easy on a highly-pipelined core like with the Pentium
4, which has a lot of idle slots in its pipeline to fit two threads. In
a shallow pipelined architecture, with fewer idle slots, fitting a
second thread in there would probably end up making one thread or the
other, or both slower. The only way around it is to actually do proper
Symettrical MultiThreading (SMT), and install more execution units for
each thread. The difference between SMT and Hyperthreading is like the
difference between a Concorde and a jumbojet -- they both achieve the
same thing, but go about it in different ways. SMT is also much more
difficult to design than not only Hyperthreading, but also more
difficult than multicores.

It would be interesting to know if they're just going to try to graft
simple HT onto the new core with any additional execution units, for a
cheap marketing stunt, despite the fact that it might slow down
applications badly. Or if they're going to do true SMT and just call it
HT to keep people from being confused.

Yousuf Khan
 
YKhan said:
X-bit labs - Hardware news - Intel Confirms New CPU Architecture to
Launch in Late 2006.
http://www.xbitlabs.com/news/cpu/display/20050512111032.html

One interesting thing they mentioned is that they're going to attempt
to retain Hyperthreading even with the new less-pipelined core.
Hyperthreading is easy on a highly-pipelined core like with the Pentium
4, which has a lot of idle slots in its pipeline to fit two threads. In
a shallow pipelined architecture, with fewer idle slots, fitting a
second thread in there would probably end up making one thread or the
other, or both slower. The only way around it is to actually do proper
Symettrical MultiThreading (SMT), and install more execution units for
each thread. The difference between SMT and Hyperthreading is like the
difference between a Concorde and a jumbojet -- they both achieve the
same thing, but go about it in different ways. SMT is also much more
difficult to design than not only Hyperthreading, but also more
difficult than multicores.

It would be interesting to know if they're just going to try to graft
simple HT onto the new core with any additional execution units, for a
cheap marketing stunt, despite the fact that it might slow down
applications badly. Or if they're going to do true SMT and just call it
HT to keep people from being confused.

I think what you are calling SMT is really multicore. The whole benefit
of HT is that it uses idle execution units with the addition of minimal
complexity, and by the time you add a lot of execution units it becomes
simpler to have individual cores with shared cache. Feel free to clarify
if you're not looking for that level of added Xunits.

What you said about pipeline length is correct, but there may be ways
around it. Consider as an example some sort of system where there are
several pipelines, one per thread, and an execution unit traffic control
which offers all available execution units to one thread, get zero or
more micro-ops started and then offers any remaining units to another
thread. Clearly this could slow a thread at some point in the future,
but would allow better use of all Xunits, and probably more work done by
the CPU overall. No matter how you add CPU Xunits, they compete for
cache and eventually total memory bandwidth.

I note that as the Linux HT scheduler has gotten better the CPU time has
stayed the same but the clock time has dropped for some benchmarks.
 
I think what you are calling SMT is really multicore.

I'm quite confident that YK knows the difference. He's been around this
block a few times. ;-)
The whole benefit of HT is that it uses idle execution units with the
addition of minimal complexity,

For *large* valuse of "minimal complexity". ...or perhaps that's why P4's
SMT sucks so badly? SMT isn't simple.
and by the time you add a lot of
execution units it becomes simpler to have individual cores with shared
cache. Feel free to clarify if you're not looking for that level of
added Xunits.

Adding execution units do nothing if you can't keep them fed. The idea of
SMT is to keep the execution units busy when a pipe has to be flushed.
More execution units don't help if you can't keep 'em full.

If you think multiple processors (are you confusing "cores" with
"execution units"?) magically solve problems, well perhaps you want to
talk to the software types.
What you said about pipeline length is correct, but there may be ways
around it. Consider as an example some sort of system where there are
several pipelines, one per thread, and an execution unit traffic control
which offers all available execution units to one thread, get zero or
more micro-ops started and then offers any remaining units to another
thread.

Huh? "some sort of" is rather vague. That's basically how SMT works,
except that each execution unit can operate on each thread
"simultaneously". Why would you want to limit the thread to an execution
unit? If that thread fhushes, that execution unit is hosed (which is
contrary the whole point of SMT)
Clearly this could slow a thread at some point in the future,
but would allow better use of all Xunits, and probably more work done by
the CPU overall. No matter how you add CPU Xunits, they compete for
cache and eventually total memory bandwidth.

If you add more execution units than you can dispatch to, or complete
from, you add power without adding any thoughput. It doesn't matter how
many threads you have.
I note that as the Linux HT scheduler has gotten better the CPU time has
stayed the same but the clock time has dropped for some benchmarks.

<shrug> HT, at least as it exists in the P4, is a waste of silicon.
 
For *large* valuse of "minimal complexity". ...or perhaps that's why P4's
SMT sucks so badly? SMT isn't simple.

I think you're kind of hitting the nail on the head with the second
option. My understanding is that SMT added only a very small number
of transistors to the core (the numbers I've heard floated around are
5-10%, though I have no firm quote and I'm not sure if that's for
Northwood or Prescott). With IBM's Power5, where the performance
boost from SMT is much larger, I understand that they were looking at
a 25% increase in the transistor count.

That actually brings up a rather interesting point though. At some
point SMT may become counter-productive vs. multi-core. In the case
of the Power5, if you need to increase you're transistor count by 25%
per core for SMT, you only need 4 cores before you've got a enough
extra transistors for another full-fledged core. That of course leads
to the question, are you better off with 4 cores with SMT or 5 cores
without? My money is on 5 cores without.
Huh? "some sort of" is rather vague. That's basically how SMT works,
except that each execution unit can operate on each thread
"simultaneously". Why would you want to limit the thread to an execution
unit? If that thread fhushes, that execution unit is hosed (which is
contrary the whole point of SMT)

Agreed.. Keep the threads to a single physical processor, but stuff
all the execution units on that one processor as full as possible
while you can. Sooner or later they're all going to hang when you run
out of data or miss a branch and you'll have to flush the whole thing
and move on to something else. Best hit that point as quickly as
possible so that you can request the data as fast as possible.
Hopefully if you're OOO execution is working you'll still have work to
do on that thread for a little while AFTER you've realized you need
the data so you can keep the expectation units doing their thing for a
little bit longer.
If you add more execution units than you can dispatch to, or complete
from, you add power without adding any thoughput. It doesn't matter how
many threads you have.


<shrug> HT, at least as it exists in the P4, is a waste of silicon.

I'm not sure I'd agree with that. There ARE some situations where it
really does help and the transistor count is apparently small enough
that it's nearly free. Recently in some of the dual-core tests I've
seen some rather extreme-case multitasking tests where they found that
even with dual-core chips hyperthreading made a very noticeable
difference on performance of background tasks without affecting the
responsiveness of the foreground task.

Now, obviously I'll take dual-core over SMT any day, but by it's very
nature dual-core involves doubling the transistors. With SMT you can
get a much smaller boost with only a relatively small increase in
transistors. As mentioned above, at some point there is a break-even
point where SMT becomes totally useless, but I don't think we're there
yet.
 
I'm quite confident that YK knows the difference. He's been around this
block a few times. ;-)

I'm sure he knows, even after rereading his post I'm less sure he chose
the optimal wording.
For *large* valuse of "minimal complexity". ...or perhaps that's why P4's
SMT sucks so badly? SMT isn't simple.




Adding execution units do nothing if you can't keep them fed. The idea of
SMT is to keep the execution units busy when a pipe has to be flushed.
More execution units don't help if you can't keep 'em full.

Gosh, and I thought they helped by doing more work in the same time even
when they are 15-30% used (what I see on P4). I think you meant that
they only help to the extent that they can be used by another thread,
and with that I agree.
If you think multiple processors (are you confusing "cores" with
"execution units"?) magically solve problems, well perhaps you want to
talk to the software types.

Where does magic come in? Less cache contention, less register
contention, more paths to the memory and i/o bus... what's not to like?
Hell, more cooling fans can't hurt, either.
Huh? "some sort of" is rather vague. That's basically how SMT works,
except that each execution unit can operate on each thread
"simultaneously". Why would you want to limit the thread to an execution
unit? If that thread fhushes, that execution unit is hosed (which is
contrary the whole point of SMT)

What made you think there was any such limitation? I was deliberately
vague to avoid a silly discussion of implementation detail. Having a
separate pipeline per thread allows independent reordering. Yes, that's
somewhat how SMT works, but what I had in mind was to offer the set of
idle Xunits to all threads, but give the units to the thread which could
do the most operations in parallel without blocking. Then put other
threads on remaining useful Xunits.
If you add more execution units than you can dispatch to, or complete
from, you add power without adding any thoughput. It doesn't matter how
many threads you have.

I can't imagine any vendor adding an Xunit which couldn't be dispatched
to. Depending on simulations I can envision adding a unit which was used
only 5-10% of the time, because a 5% increase in throughput for a very
small increase in gate count and no increase in die size would be a win.
Okay, I'm making estimates on those values, but a feature like an adder
doesn't take a lot of gates compared to the count on modern CPUs.
<shrug> HT, at least as it exists in the P4, is a waste of silicon.

You haven't done the measurements. Not only is there a decrease in clock
time for some applications, but a decrease in context switches on some
threaded applications using HT. With one CPU an application with feeder
and consumer logic runs in pure turn-taking. With HT they can sometimes
both run without a context switch for ms at a time.

Not only does the CPU do more work, but it actually can use HT to make
less work (fewer context switches) needed. That shows up as less cache
misses as well. More work done, less work needed, better cache
performance. Not a waste in my book!
 
I'm sure he knows, even after rereading his post I'm less sure he chose
the optimal wording.

I think you're wrong, but we can (and will, I'm sure;) discuss this
further.
Gosh, and I thought they helped by doing more work in the same time even
when they are 15-30% used (what I see on P4). I think you meant that
they only help to the extent that they can be used by another thread,
and with that I agree.

No, they are usable in a single-threaded processor. The current crop of
porcessors executes more than one instruction per clock, so there had
better be somewher to *execute* those instructions. The "someewhere"
would be the "execution units".
Where does magic come in? Less cache contention, less register
contention, more paths to the memory and i/o bus... what's not to like?
Hell, more cooling fans can't hurt, either.

Huh? "Execution units" are very different things than "cores" or
"threads". A single processor, with a single thread can keep several
execution units busy at the same time. That's the whole point of OoO
super-scalar processors. A single FPU, for instance, may choke a
dual-issue machine. It would do far worse on a four-issue processor.
What made you think there was any such limitation?

Forget threads and let's go at it again. I haven't a clue, now, what your
point is. Pipeline length is an argument for multi-threading, but far
less of one for OoO or super-scalar. ...other than it's easier to find
parallelism with more pipe stages, but that has nothign to do with stalls.
I was deliberately
vague to avoid a silly discussion of implementation detail.

Well, there is the devil, eh?
Having a
separate pipeline per thread allows independent reordering. Yes, that's
somewhat how SMT works, but what I had in mind was to offer the set of
idle Xunits to all threads, but give the units to the thread which could
do the most operations in parallel without blocking. Then put other
threads on remaining useful Xunits.

In your dreams, perhaps. But that's not how processors work. Execution
units can be kept busy even bound to a single thread. There is no
requirement, nor reason, to dedicate execution units to a thread. To do
so is simply silly, when a single thread may be able to use them more
effectively.
I can't imagine any vendor adding an Xunit which couldn't be dispatched
to. Depending on simulations I can envision adding a unit which was used
only 5-10% of the time, because a 5% increase in throughput for a very
small increase in gate count and no increase in die size would be a win.

You missed the point. You *can* (and it's done all the time) dispatch,
from a single thread, to more than one execution unit at a time.
To throw more execution units at a CPU than you can dispatch to , or
complete from, is certainly a waste. Two FPUs, ro example, is certainly
not a waste if you can dispatch/complete four instructions per clock.
Okay, I'm making estimates on those values, but a feature like an adder
doesn't take a lot of gates compared to the count on modern CPUs.

An adder is not an "execution unit". There are hundreds of adders in a
processor.

You haven't done the measurements. Not only is there a decrease in clock
time for some applications, but a decrease in context switches on some
threaded applications using HT. With one CPU an application with feeder
and consumer logic runs in pure turn-taking. With HT they can sometimes
both run without a context switch for ms at a time.

It's a well known fact that the P4 SMT is badly lacking. It seems that
90% of the users shut it off. That's not to say that SMT has no ue, but
Intel screwed the pooch on that one (again).
Not only does the CPU do more work, but it actually can use HT to make
less work (fewer context switches) needed. That shows up as less cache
misses as well. More work done, less work needed, better cache
performance. Not a waste in my book!

You must be an Intel marketeer. Screw SMT and go SMP, if you must.
 
In your dreams, perhaps. But that's not how processors work. Execution
units can be kept busy even bound to a single thread. There is no
requirement, nor reason, to dedicate execution units to a thread. To do
so is simply silly, when a single thread may be able to use them more
effectively.

You totally lost me here. You said (a) Xunits can be kept busy when
bound to a single thread, then (b) there's no reason to do that, then
(c) a single thread can use them more effectively.
You must be an Intel marketeer. Screw SMT and go SMP, if you must.

What does marketing have to do with it? HT makes programs run faster ON
than OFF. Any arguments that it can't are suspect.
 
What does marketing have to do with it? HT makes programs run faster ON
than OFF. Any arguments that it can't are suspect.

It's also pretty obvious that in some, not so rare, task mixes HT can make
all tasks/threads run slower... i.e. longer time to complete than if run
consecutively. I'd hesitate to use it for any situation where I had a
compute bound task.
 
I'd rather think that HT will slow down for L1-cache bound tasks. You
effectively have half of each cache and half of uOP cache for each thread.
 
You totally lost me here. You said (a) Xunits can be kept busy when
bound to a single thread, then (b) there's no reason to do that, then
(c) a single thread can use them more effectively.

You obviously don't read very well.

A) You stated that execution units were only necessary for multiple
threads. False. A single thread can use multiple execution units in a
super-scalar processor. An OoO processor has more opportunity to find
parallelism in a single thread. Multiple execution units came long before
multi-threaded processors (well, ignoring the 360/91).

B) THere is *every* reason to have multiple execution units for a single
threaded processor (see: A).

C) Since there is no reason that multiple threads are necessary to keep
many execution units busy, this is *not* a reason for multi-threaded
processors. In fact multiple threads (at least as Intel does things)
isn't much of a gain at all and often a negative.

--
Keith

What does marketing have to do with it? HT makes programs run faster ON
than OFF. Any arguments that it can't are suspect.

Because Intel's HT is a marketing gimmick that you've obvioulsy fallen
for. ...and you're spreading the FUD.
 
A) You stated that execution units were only necessary for multiple
threads. False. A single thread can use multiple execution units in a
super-scalar processor. An OoO processor has more opportunity to find
parallelism in a single thread. Multiple execution units came long before
multi-threaded processors (well, ignoring the 360/91).
Correct.

B) THere is *every* reason to have multiple execution units for a single
threaded processor (see: A).
Correct.

C) Since there is no reason that multiple threads are necessary to keep
many execution units busy, this is *not* a reason for multi-threaded
processors.

Wrong. The general form of your argument is wrong and it's wrong in this
particular situation as well.

The flaw in the general form of your argument can easily be seen if you
try the argument on other things. For example, you don't need to brush your
teeth daily to have healthy teeth. You could, for example, go to a dental
hygienist daily. It does not, however, follow that having healthy teeth is
not a reason to brush daily.

It's wrong in this particular case because one of the main benefits of
multi-threaded processors is that execution units that would otherwise lie
idle can do useful work. The more parallelism you can exploit, the greater
percentage of your execution units you can keep busy. Multi-threaded
processors give the processor more parallelism to exploit.
In fact multiple threads (at least as Intel does things) isn't much of a
gain at all and often a > negative.

Actually, in my experience is has been a *huge* benefit on machines that
only have a single physical CPU. Not as useful on machines that have
multiple CPUs already.

DS
 
I'd rather think that HT will slow down for L1-cache bound tasks. You
effectively have half of each cache and half of uOP cache for each thread.

Yup and the TLB, which is a *big* part of CPU performance is going to get
soiled.
 
Tony said:
Now, obviously I'll take dual-core over SMT any day, but by it's very
nature dual-core involves doubling the transistors.

Not if part of the cache hierarchy is shared between cores,
e.g. Intel's Yonah.

By the way, you often write "it's" instead of its ;-)
 
keith said:
You obviously don't read very well.

Given that zero of the things you reply to below are in the test you
quoted, or in the original article, I don't think the problem is mine.
A) You stated that execution units were only necessary for multiple
threads. False. A single thread can use multiple execution units in a
super-scalar processor. An OoO processor has more opportunity to find
parallelism in a single thread. Multiple execution units came long before
multi-threaded processors (well, ignoring the 360/91).

B) THere is *every* reason to have multiple execution units for a single
threaded processor (see: A).

C) Since there is no reason that multiple threads are necessary to keep
many execution units busy, this is *not* a reason for multi-threaded
processors. In fact multiple threads (at least as Intel does things)
isn't much of a gain at all and often a negative.



Because Intel's HT is a marketing gimmick that you've obvioulsy fallen
for. ...and you're spreading the FUD.

I ran real benchmarks, for large compiles, DNS servers, and NNTP
servers. The compiles ran in 10-30% less clock time, the max tps of the
servers went up 10-15%. That's not FUD that's FACT.
 
David said:

I assume you mean he's correct in his technical statement, and not that
you agree I ever said any such thing...
Wrong. The general form of your argument is wrong and it's wrong in this
particular situation as well.

The flaw in the general form of your argument can easily be seen if you
try the argument on other things. For example, you don't need to brush your
teeth daily to have healthy teeth. You could, for example, go to a dental
hygienist daily. It does not, however, follow that having healthy teeth is
not a reason to brush daily.

It's wrong in this particular case because one of the main benefits of
multi-threaded processors is that execution units that would otherwise lie
idle can do useful work. The more parallelism you can exploit, the greater
percentage of your execution units you can keep busy. Multi-threaded
processors give the processor more parallelism to exploit.




Actually, in my experience is has been a *huge* benefit on machines that
only have a single physical CPU. Not as useful on machines that have
multiple CPUs already.

Thank you, I'm not sure I've seen *huge* gains, but 10-30% for free is a
nice bonus. I've never seen a negative on real work, although there was
a benchmark showing that. Gain appear larger on threaded applications
than general use, probably because of more shared code and data in cache.

The real gain I see is when multiple threads exchange data via shared
memory. With one CPU there are constant context switches between the
producer and consumer threads. With SMT the number of CTX goes down,
which means that the CPU not only does more work in unit time, but that
the work to be done is reduced. User CPU percentage goes up, CTX rate
goes down, system time goes down. Win-win-win!
 
Wrong. The general form of your argument is wrong and it's wrong in this
particular situation as well.

Nope. I'm now rather more confident that you haven't a clue.

It's wrong in this particular case because one of the main benefits of
multi-threaded processors is that execution units that would otherwise lie
idle can do useful work. The more parallelism you can exploit, the greater
percentage of your execution units you can keep busy. Multi-threaded
processors give the processor more parallelism to exploit.

That is *not* the point. Modern processors are more limited in dispatch
and completion slots than they are in execution units (e.g. the developers
don't kow how many FP instructions you're going to run togethr). As long
as a single thread can dispatch the processor will be full. Another thread
is *ONLY* useful if the pipe stalls. Even then, it's only useful to
restart another thread if your caches aren't trashed. Another thread can
muck up the works in any number of ways other than the caches.
Actually, in my experience is has been a *huge* benefit on machines
that
only have a single physical CPU. Not as useful on machines that have
multiple CPUs already.

Your workload is quite unique then. No one else, other than Intel's
marketing department, has found such workload.
 
Nope. I'm now rather more confident that you haven't a clue.

Always good to throw in a few insults while someone's trying to reason
with you. Your mother dresses you funny.
That is *not* the point. Modern processors are more limited in dispatch
and completion slots than they are in execution units (e.g. the developers
don't kow how many FP instructions you're going to run togethr). As long
as a single thread can dispatch the processor will be full. Another thread
is *ONLY* useful if the pipe stalls. Even then, it's only useful to
restart another thread if your caches aren't trashed. Another thread can
muck up the works in any number of ways other than the caches.

This doesn't sound like anything even remotely resembling a reasonable
argument. It is a fact that a single thread is just not going to keep all
the execution units busy. Another thread could use those execution units.
Your workload is quite unique then. No one else, other than Intel's
marketing department, has found such workload.

Here's a trivial example -- one program goes into a 100% CPU spin. With
HT, the system stays responsive (because the program can, at most, grab half
the CPU resources). Without it, it doesn't. Now you think a program that has
to do a lot of work while I'd prefer the system remain responsive is
unique?!

DS
 
Always good to throw in a few insults while someone's trying to reason
with you.

Hmm, I didn't see much "reason".
Your mother dresses you funny.

s/mother/wife

I went on my way 34 years ago.
This doesn't sound like anything even remotely resembling a reasonable
argument. It is a fact that a single thread is just not going to keep all
the execution units busy. Another thread could use those execution units.

All the execution units won't be busy because there aren't enough
issue/completion slots to fill all units. Another thread doesn't increase
the number of I/C slots. A single thread can easily fill the slots
available.

The argument for a second thread isn't execution units, rather OoO,
speculative execution, long pipes, and slow memory, thus expensive
flushes. Adding a thread adds more speculative execution and resourse
thrashing for *perhaps* a chance of utilizing the pipeline when one thread
flushes. If it's done right it even works. Apparently Intel has an
"issue" with their implementation. It's not a clear win like you folks
believe it to be.
Here's a trivial example -- one program goes into a 100% CPU spin.
With
HT, the system stays responsive (because the program can, at most, grab
half the CPU resources). Without it, it doesn't. Now you think a program
that has to do a lot of work while I'd prefer the system remain
responsive is unique?!

Like all trivial examples and hand-waving...

This is perhaps a good argument for SMP, but SMT will likely still
choke because the thread that's "spinning" isn't likely flushing the pipe,
since the pre-fetching/branch prediction is trying it's best to keep the
pipe full. Of course for any implementation it's possible to come up with
a degenerative case. As noted elsewhere in this thread a "spinning thread"
can trash the L1, perhaps even L2, causing SMT make things even worse.
Indeed, this is shown in several benchmarks.

Sometimes (Intel's implementation of) SMT is a win, sometimes a loss.
It's not at all clear whether it's worth it, but in any case it has
*nothing* to do with filling execution units (the OP's argument).
Multiple issue/completion slots will fill execution units from a single
thread.
 
Not if part of the cache hierarchy is shared between cores,
e.g. Intel's Yonah.

Perhaps I should have specified that you're doubling the transistors
in the core at the very least and possibly doubling cache transistors
as well.
By the way, you often write "it's" instead of its ;-)

Yeah, I do it mainly to piss of Keith who has commented on this more
than once! :> (actually I'm just lazy and never did learn me that
grammar stuff none too good!)
 
In comp.sys.ibm.pc.hardware.chips keith said:
All the execution units won't be busy because there
aren't enough issue/completion slots to fill all units.
Another thread doesn't increase the number of I/C slots.
A single thread can easily fill the slots available.

Very true, especially on a CPU like the iP7 (Pentium4)
that has lots of execution units, but very few issue ports.

AFAIK, the only case where SMT is a win is when a thread
stalls like waiting for uncached data, IO or frequent branch
misprediction. Otherwise it is a loss because of lower cache
hits (caches split). Some apps, like relational databases
are pointer chasing exercises and need a lot of uncached data.
I think compilers suffer a lot of misprediction.

-- Robert
 
Back
Top