Intel COO signals willingness to go with AMD64!!

  • Thread starter Thread starter Yousuf Khan
  • Start date Start date
Well, yeah - there was that little legal tiff about Intel treading upon the
Digital ip portfolio ;-)

Umm, didn't they get the MA fab (whoopie!) and Strong-ARM out of
that tiff? ...not Alpha? Alpha went to the Q, to be swallowed
whole by Carley's Borg, no?
 
http://www.byte.com/art/9612/sec6/art3.htm

1993: First out-of-order execution microprocessor:
IBM and Motorola PowerPC 601

1995: Pentium Pro
[My words] first x86 with OoO core, very controversial because it
didn't execute legacy 16-bit applications well, but it was the first
implementation of the P6 Core. Introduced many other innovations as
well, including register renaming, which is essential to OoO having
any impact at all.

....and about the same time ('96 or slightly before) the Cyrix M1,
which did a whole lot better at 16bitness.

OoO was well known. It was difficult and transistors a tad more
scarce.

<snip>
 
David Schwartz said:
I think more than anything, these types of annoucements are calculated
to try to slow down the adoption of the Opteron. If there's going to be
something new "any day now", you don't want to take any radical steps today.

DS

Interesting... Since the Opteron was all about not taking the radical
step with Itanic.

Carlo
 
(e-mail address removed) says...


...late to the party, but:

Gee, even the much maligned Cyrix 6X86 was an OoO processor, sold
in what, 1986? Evidently Cyrix thought it was a winner, and they
weren't wrong.

Since processor and memory speeds were much more nearly comparable at
the time, it would have required predictive capacity bordering on the
spooky to have foretold _why_ OoO would turn out to be so important.

RM
 
^^^^
?? I think you're off by a decade there Keith!
Since processor and memory speeds were much more nearly comparable at
the time, it would have required predictive capacity bordering on the
spooky to have foretold _why_ OoO would turn out to be so important.

Is it really that spooky to predict that the gap would continue to
grow? In 1980 most processors could get a piece of data from memory
for every clock cycle. By 1990 they were waiting 10-20 clock cycles.
Was it that big of a leap of faith to think that in another 10 years
that number would have increased to 100-200 clock cycles? It was
quite plain that the rate that clock speeds were increasing at was
MUCH faster than the rate at which memory latency was decreasing.
 
Umm, didn't they get the MA fab (whoopie!) and Strong-ARM out of
that tiff? ...not Alpha? Alpha went to the Q, to be swallowed
whole by Carley's Borg, no?

Alpha has slowly but surely being migrating over to Intel. Part of
the same deal in which they got the fab and StrongARM gave Intel some
of the Alpha rights. Then later they got a bit more during the Compaq
years. After the whole DEC/Alpha group merged into HPaq, the Itanium
group and Alpha group seem to have sort of merged, and a number of
these people have since moved over to Intel as part of IA64 deals.

In short, the entire Alpha organization has been dismantled, piece by
piece. The technology and the people (at least those that didn't
jump-ship) have pretty much all found their way over to Intel in some
fashion, even if there never was an official buyout.
 
Is it really that spooky to predict that the gap would continue to
grow? In 1980 most processors could get a piece of data from memory
for every clock cycle. By 1990 they were waiting 10-20 clock cycles.
Was it that big of a leap of faith to think that in another 10 years
that number would have increased to 100-200 clock cycles? It was
quite plain that the rate that clock speeds were increasing at was
MUCH faster than the rate at which memory latency was decreasing.

Hmmm. Let's wind the clock back to, say, 1990.

Robert: Hmmm. We're waiting 10-20 clock cycles now for data from
memory. If I'm not mistaken, we started out with memory and the CPU
at the same speed. If we put this all on a log plot, we're going to
be waiting 100-200 cycles by the millenium.

Tony: Oh, don't worry. Out of order execution will fix everything.

Robert: What?

Tony: Out of order execution. We'll issue instructions, keep them in
flight, and execute them as the data become available.

Robert: Are you telling me that, ten years from now, we're going to
have hundreds of instructions in process in the CPU at one time?

Tony: Yup. No problem.

Robert: You're out of your mind.

Tony: You just wait and see.

Since your prognostic powers are plainly much better than mine, could
you slip me an e-mail as to what's going to happen in the _next_ ten
years? ;-).

RM
 
Tupo alert! I meant 1996.
Since processor and memory speeds were much more nearly comparable at
the time, it would have required predictive capacity bordering on the
spooky to have foretold _why_ OoO would turn out to be so important.

It was quite a win in 1996. Caches were already all the rage.
 
Okay folks, load up the baskets with rotten fruit and vegetables.
Haul out the flamethrowers, 'cause here it comes. AMD usually rides
on Intel's marketing coattails. In the case of x86-64, they have
beaten Intel at its own game by creating a desire where there is no
actual need. That's marketing.

RM
No fruit launches from me, you make a good point.
But also from a 'feeling' pov, x86-64 has a good feel about it,
and even more important, there IS a need for faster desktop PC NOW,
especially for video applications that become more and mode common,
and 64 bit is most welcome NOW, so it is also a real NEED.
(64 bits creates the expectancy of faster at least ;-) ).
These 4, need, good feel, compatibility, and availability is a real winner.
Intel feels the heat, but has no answer for the next 2 quarters at least.
Jan
 
Robert Myers said:
Since your prognostic powers are plainly much better than mine, could
you slip me an e-mail as to what's going to happen in the _next_ ten
years? ;-).

RM


Crystaline semi-organic holographic storage replacing hard-drives and
RAM, and a switch back to 100% execute-in-place architectures. No
buses either, as the chip substrate will be bonded directly to the
storage ala LCOS.

Hey, if you're going to guess, it might as well be a doosey.
 
Hmmm. Let's wind the clock back to, say, 1990.

Robert: Hmmm. We're waiting 10-20 clock cycles now for data from
memory. If I'm not mistaken, we started out with memory and the CPU
at the same speed. If we put this all on a log plot, we're going to
be waiting 100-200 cycles by the millenium.

Tony: Oh, don't worry. Out of order execution will fix everything.

Robert: What?

Tony: Out of order execution. We'll issue instructions, keep them in
flight, and execute them as the data become available.

Robert: Are you telling me that, ten years from now, we're going to
have hundreds of instructions in process in the CPU at one time?

Tony: Yup. No problem.

Robert: You're out of your mind.

Tony: You just wait and see.

Yeah, that's about the long and the short of it. It certainly was
enough to convince everyone else in the CPU business. Of course,
maybe that's because the flip side of this argument in favor of the
Itanium design went something like this:


Robert: Hmmm. We're waiting 10-20 clock cycles now for data from
memory. If I'm not mistaken, we started out with memory and the CPU
at the same speed. If we put this all on a log plot, we're going to
be waiting 100-200 cycles by the millennium.

Tony: Yup, this is going to kill the performance if we stick to
in-order CPU design.

Robert: I know, we'll design an entirely new system where the
compiler explicitly states what instructions can be grouped together,
re-order everything and predict all the possible execution paths at
compile time. Ohh, and we'll ask someone else to make the compiler
for it.

Tony: Are you telling me that, ten years from now, we're going to
have compilers that can predict at compile time dozens of different
instructions that can be executed at the same time without data
conflicts?

Robert: Yup, no problem.

Tony: You're out of your mind.

Robert: You just wait and see.


I think everyone recognized the potential problem. Intel/HP took one
approach to try and solve that problem, every single other CPU
designer in the industry took another approach. Of course, it's quite
possible that if every other CPU designer had decided to go for the
VLIW/EPIC route that we would have magical compilers by now and
Itanium would have performed great right from the get-go. However
there was obviously enough evidence out there in the early '90s that
OoO was the way to go that pretty much everyone chose that path.
Since your prognostic powers are plainly much better than mine, could
you slip me an e-mail as to what's going to happen in the _next_ ten
years? ;-).

More of the same? Just look at some of the current trends:

- Memory latency relative to processor speed will continue to grow.
There have been some suggestions of redesigning memory to fix this
problem but I, for one, am not holding my breath. Integrating the
memory controller onto the CPU die is an obvious idea here and I
suspect that most CPUs will go this route in the future. That won't
solve the problem, but every little bit helps.

- Power dissipation. This is somewhat of a newish problem (it's
always existed, but is a much bigger concern now). Processor power
consumption has been rising steadily and quite rapidly. Improvements
in manufacturing process are not keeping up at all. We're already
seeing a push towards performance/watt rather than just raw
performance, but this will become rather critical in 10 years time.
Not too long ago Intel did a study that suggested 1KW processors by
the end of the decade if things kept going at the current rate.

- More transistors than we know what to do with. With 90nm fabs it's
fairly easy to throw 100M transistors at a die and it can still be
pretty cheap. By the end of the decade we'll be looking at 500M+
transistors on a die, but what do you do with all of them? Part of
the problem here is that design costs have become such a dominant
portion of the expenses that paying someone to design those extra
transistors might not be worthwhile. This also ties into the power
consumption thing, because more transistors tends to mean more power
consumed.

There are obviously some other considerations, but these are some of
the important factors that CPU designers are going to have to think
about in their current and near-future designs. Back in 1990 dealing
with the rapidly growing gap in processor speed vs. memory latency was
the big issue. Right now, the big issue I see going forward is
performance/watt. The number of applications that are CPU-bound has
shrunk considerably in the last 10 years, but where the CPU used to
only consume 5-10% of the power in a computer it is now consuming
20-30%. Unless this problem is tackled CPUs will be consuming 50%+ of
a computers power while spending an awful lot of time waiting on I/O.
 
Interesting... Since the Opteron was all about not taking the radical
step with Itanic.

Remember Intel's #1 outlet does not do AMD. IOW people can say "I'll wait
till Dell has one".

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??
 
Robert: I know, we'll design an entirely new system where the
compiler explicitly states what instructions can be grouped together,
re-order everything and predict all the possible execution paths at
compile time.

Believe it or not, within the last year, I have stood in front of a
group of people, some of them very well known in the business, some of
them very well known outside the business, and all of them with
impressive accomplishments behind them, and defended that very idea.

One person in the room interrupted my talk to say that he had told
Intel from the very beginning that the feedback loop was closed too
far away from the action, and another in the front row, who got up
early in the morning to hear me talk about this (and knew what was
coming) shook his head in lamentation at, as he said it, "all the IQ
points that had been wasted on this problem."

Some of us are just stubborn. Whether or not it was or is a good
strategy for Intel, there are very good reasons why you want to
understand this strategy and why it does or does not work. It fits
into a long tradition of research that has occupied some of the very
best people in the business and continues to occupy some of the very
best people in the business.

Please note that I make no claim to greatness or even to competence by
mere association. I only wish to note that I am not the only one who
has seen Itanium as an opportunity to work on a very fundamental
problem of potentially great importance to the future of computation.

RM
 
Hmmm. Let's wind the clock back to, say, 1990.

Let's wind the clock back to 1970 (perhaps 1960). We had the
same issues then. Caches weren't invented in 1990, as you
apparently believe.

<snip>
 
Robert said:
Hmmm. Let's wind the clock back to, say, 1990.

Robert: Hmmm. We're waiting 10-20 clock cycles now for data from
memory. If I'm not mistaken, we started out with memory and the CPU
at the same speed. If we put this all on a log plot, we're going to
be waiting 100-200 cycles by the millenium.

Tony: Oh, don't worry. Out of order execution will fix everything.

Robert: What?

Tony: Out of order execution. We'll issue instructions, keep them in
flight, and execute them as the data become available.

Robert: Are you telling me that, ten years from now, we're going to
have hundreds of instructions in process in the CPU at one time?

Tony: Yup. No problem.

Robert: You're out of your mind.

Tony: You just wait and see.

Since your prognostic powers are plainly much better than mine, could
you slip me an e-mail as to what's going to happen in the _next_ ten
years? ;-).

Excellent post. Why the heck couldn't you guys have started
off with that ? Sums it up veddy noicely.
 
Let's wind the clock back to 1970 (perhaps 1960). We had the
same issues then. Caches weren't invented in 1990, as you
apparently believe.

What in that post would lead you to believe that I thought cache was
invented in 1990? The (implied) context was the world of
microprocessors, and the particular point of discussion was the
importance _cache_misses_.

CACHE misses, Keith. That's what we were talking about. Could anyone
grasp how catastrophic CACHE misses would become for performance and
how it would be handled in the long run. C-A-C-H-E misses.

To clear up any possible confusion about how well this issue was (not)
understood as late as the mid-90's, I went looking for references to
the term "compulsory cache miss". Among other things, I turned up

http://www.complang.tuwien.ac.at/anton/memory-wall.html

by Anton Ertl, a frequent and respected contributor to comp.arch. In
that document, he is taking aim at a famous paper, "Hitting the Memory
Wall: Implications of the Obvious" by Wulf et. al.

The belief at the time was that computing time would eventually be
dominated by compulsory cache misses, and Ertl's main beef was that
"Hitting the Memory Wall" made unwarranted assumptions about
compulsory cache misses. Even Prof. Ertl missed the point.

In order to make it completely, utterly, unalterably, unarguably,
transparently clear just *how* poorly the issue was understood at the
time, I am going to make an extended quote from the famous Wulf paper:

<begin quote>

To get a handle on the answers, consider an old friend the equation
for the average time to access memory, where t c and t m are the cache
and DRAM access times andp is the probability of a cache hit:

tavg = p*tc + (l-p) *tm

We want to look at how the average access time changes with
technology, so we'll make some conservative assumptions; as you'll
see, the specific values won't change the basic conclusion of this
note, namely that we are going to hit a wall in the improvement of
system performance unless something basic changes.

First let's assume that the cache speed matches that of the processor,
and specifically that it scales with the processor speed. This is
certainly true for on-chip cache, and allows us to
easily normalize all our results in terms of instruction cycle times
(essentially saying t c = 1 cpu cycle). Second, assume that the cache
is perfect. That is, the cache never has a conflict
or capacity miss; the only misses are the compulsory ones. Thus ( 1
-p) is just the probability of accessing a location that has never
been referenced before (one can quibble
and adjust this for line size, but this won't affect the conclusion,
so we won't make the argument more complicated than necessary).
Now, although ( 1 -p) is small, it isn't zero_ Therefore as t c and t
m diverge, tavg will grow and system performance will degrade. In
fact, it will hit a wall.

<end quote>

Simple, obvious, easy to state, easy to understand, and WRONG.
Today's computer programs hide many so-called compulsory misses by
finding something else for the processor to do while waiting for the
needed data to become available. What makes OoO so powerful is that
it can hide even compulsory misses, and people just didn't get it,
even by the mid-nineties. People were still thinking of cache in
terms of data re-use.

No one whose primary experience was on a Cray-1 type machine would
have made such a mistake, because there was no cache to miss, and
memory access latency was significant (eight cycles if I recall
correctly without checking). Cray-type machines had been hiding most
so-called compulsory misses ever since the machine went into
production, and with in-order execution, by the simple expedient of
pipelining.

The term compulsory cache miss can still be found in much more recent
references, e.g.,

http://www.extremetech.com/article2/0,3973,34539,00.asp

but I have no idea why people keep talking about compulsory misses,
because the concept has turned out not to be all that important.

The misses that count are not first reference misses, or compulsory
misses, but first reference misses that are made too late to avoid
stalling the pipeline. For a processor like Itanium, whether you can
make memory requests early enough to avoid stalling the pipeline
depends on how predictable the code is. For an OoO processor, whether
you can make requests early enough to avoid stalling the pipeline
depends on lots of things, including how aggressive you want to be in
speculation and how much on-die circuitry you are willing to commit to
instruction juggling.

The designers of Itanium bet that on-die scheduling would take too
many transistors and too much power to work well. They bet wrong, but
to say that the issues were well understood when they put their money
down is simply to ignore history.

RM
 
Tony said:
Robert: I know, we'll design an entirely new system where the
compiler explicitly states what instructions can be grouped
together, re-order everything and predict all the possible
execution paths at compile time. Ohh, and we'll ask someone
else to make the compiler for it.

Intel did write their own compiler for IPF, and, unsurprisingly, it is
the best compiler available for that platform, as far as I know.

http://www.intel.com/software/products/compilers/linux/
 
Keith R. Williams said:
Let's wind the clock back to 1970 (perhaps 1960). We had the
same issues then. Caches weren't invented in 1990, as you
apparently believe.

They were for desktop micros. ;-)

Stuff like caches were invented first for mainframes, then for
minicomputers, and recently (historically speaking) for desktop
microprocessors.

Question: when will the first smoke-detector micro to use an L1 cache
be introduced?
 
Felger Carbon said:
Question: when will the first smoke-detector micro to use an L1 cache
be introduced?

Shortly after the itanium laptop.

-wolfgang
 
Believe it or not, within the last year, I have stood in front of a
group of people, some of them very well known in the business, some of
them very well known outside the business, and all of them with
impressive accomplishments behind them, and defended that very idea.

One person in the room interrupted my talk to say that he had told
Intel from the very beginning that the feedback loop was closed too
far away from the action, and another in the front row, who got up
early in the morning to hear me talk about this (and knew what was
coming) shook his head in lamentation at, as he said it, "all the IQ
points that had been wasted on this problem."

As mentioned previously, I keep hearing about this "feedback loop" and,
while its importance to VLIW/EPIC seems obvious, have trouble seing how it
fits into the model for delivery of commercial software. Is every client
supplied with a compiler "free" or does the price have to be included in
the software?... or is Intel going to give compilers away to sell CPU
chips?... or any other of the various permutations for supplying the
capability? BTW I am looking at a future world where Open Source will not
displace paid-for software, especially in the domain of "difficult
problems".

From a practical standpoint, are we to believe that a re-train has to be
done for every variation on the "dataset"? How many (near) repetitions on
a given "dataset" make it worthwhile to do the re-train? Can that even be
defined?
Some of us are just stubborn. Whether or not it was or is a good
strategy for Intel, there are very good reasons why you want to
understand this strategy and why it does or does not work. It fits
into a long tradition of research that has occupied some of the very
best people in the business and continues to occupy some of the very
best people in the business.

Are you familar with the term "perfect future technology"?:-)
Please note that I make no claim to greatness or even to competence by
mere association. I only wish to note that I am not the only one who
has seen Itanium as an opportunity to work on a very fundamental
problem of potentially great importance to the future of computation.

Have you considered that as the complexity of the solution exceeds that of
the problem we have an umm, enigma?

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??
 
Back
Top