Intel Core 2 Extreme X6800 Preview from Taiwan

  • Thread starter Thread starter Yousuf Khan
  • Start date Start date
Y

Yousuf Khan

Sure looks like Intel has leapfrogged AMD as badly as AMD had previously
leapfrogged Intel. The only problem I see though is that Intel isn't
expecting to have a lot of Core 2 Duos available for a while. Only 25%
of its production is going to be of this generation, the remaining 75%
will still be of the old Netburst generation. This means that it's going
to be selling tons of cheap undesirable Netburst processors at firesale
prices, which will result in a pricing war.

AnandTech: Intel Core 2 Extreme X6800 Preview from Taiwan
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2771
 
Yousuf said:
Sure looks like Intel has leapfrogged AMD as badly as AMD had previously
leapfrogged Intel. The only problem I see though is that Intel isn't
expecting to have a lot of Core 2 Duos available for a while. Only 25%
of its production is going to be of this generation, the remaining 75%
will still be of the old Netburst generation. This means that it's going
to be selling tons of cheap undesirable Netburst processors at firesale
prices, which will result in a pricing war.

AnandTech: Intel Core 2 Extreme X6800 Preview from Taiwan
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2771

I must have been discussed before, but could you summarize in
a few sentences what are the main points that make Core 2 Duo work
so well across the board despite lower memory bandwitch and higher
latency. What is it - more cache? higher frequency? architecture
improvements?

Regards,
Evgenij
 
David said:
Excelent. That is exactly what I was looking for.
The short summary appears to be:
- many significant architecture improvements (main points are 4
operations vs 3 per cycle (~25% improvement), fusion of several external
ops into one internal (~10% improvement))
- higher frequency capability due to 65 nm process
- better use of L1 cache due to shared access between 2 CPUs
- more power management and 65 nm results in better efficiency
did I miss something critical?

Regards,
Evgenij
 
Evgenij said:
Excelent. That is exactly what I was looking for.
The short summary appears to be:
- many significant architecture improvements (main points are 4
operations vs 3 per cycle (~25% improvement), fusion of several external
ops into one internal (~10% improvement))
- higher frequency capability due to 65 nm process
- better use of L1 cache due to shared access between 2 CPUs
- more power management and 65 nm results in better efficiency
did I miss something critical?

Regards,
Evgenij

Here you guys go. One for your very own to play with. Go to

http://www.techonline.com/community/prod_eval/devel_systems/39098


- Processor: Intel® Core™ Duo 2.0GHz dual-core processor with 667MHz FSB
- Chipset: Intel® 82945GM GMCH with Graphics Media Accelerator 950 core
at 250MHz
- RAM: 256MB DDR2 system memory running at 667MHz
- Operating System: Windows XP Pro

They have other stuff to play with if you go upstream..... to
http://www.techonline.com/community/prod_eval/devel_systems
 
Evgenij said:
I must have been discussed before, but could you summarize in
a few sentences what are the main points that make Core 2 Duo work
so well across the board despite lower memory bandwitch and higher
latency. What is it - more cache? higher frequency? architecture
improvements?

Yeah, there's been some micro-architecture improvements, but that's to
be expected. Every new generation there are micro-architectural
improvements that will blow away the previous generation (there were
similar descriptions about Pentium 4's micro-architecture when it was
first introduced), but it's always been a little dubious how much gain
they actually get simply from micro-architecture in the real world. But
I think the real story here is Core 2's cache. Intel is managing to get
the same levels of latency from Core 2 that AMD gets from AMD64, even
without an inboard memory controller! It's likely that Core 2 is driving
close to maximum performance out of its FSB, more often than any
previous Intel architecture.

Yousuf Khan
 
Yousuf Khan said:
Yeah, there's been some micro-architecture improvements, but that's to be
expected. Every new generation there are micro-architectural improvements
that will blow away the previous generation (there were similar
descriptions about Pentium 4's micro-architecture when it was first
introduced), but it's always been a little dubious how much gain they
actually get simply from micro-architecture in the real world. But I think
the real story here is Core 2's cache. Intel is managing to get the same
levels of latency from Core 2 that AMD gets from AMD64, even without an
inboard memory controller! It's likely that Core 2 is driving close to
maximum performance out of its FSB, more often than any previous Intel
architecture.

Yousuf Khan

Not at all. There have been many benchmarks showing that the increase in
performance of increasing the cache is relatively minor in the majority of
situations. It is true that "micro-architectural improvements" generally
lead to relatively small increases in performance. The one big difference
that has always been associated with major leaps in performance is when you
increase the number of instructions per clock. The reason that Conroe is so
exciting is that it does increase the IPC (for the first time since the
pentium pro?). An anandtech article states that "It can decode 4 x86
instructions per cycle, and sometimes 5 thanks to x86 fusion. AMD's Hammer
can do only 3." This is undoubtedly the major reason for the Core being so
much faster.
 
In comp.sys.ibm.pc.hardware.chips Yousuf Khan said:
Yeah, there's been some micro-architecture improvements,
but that's to be expected.

More than a few. There's a complete extra ALU+SSE2 _and_
the issue port to run it. With very tight (library) ASM, this
core will be quad issue. Most likely not on compiled x86-32
code (maybe on x86-64) because of only a single load port.
What wasn't mentioned were the multiplier(s).

The read/write reorder buffer algorithm has been significantly
altered, most likely for the better.
Every new generation there are micro-architectural
improvements that will blow away the previous generation

Can you not separate wheat from chaff? There's always
marketroid drool. Sometimes, it's even valid.
(there were similar descriptions about Pentium 4's
micro-architecture when it was first introduced), but it's

Purest drool. From day zero, the Pentium4 was obviously
a dual issue CPU that had only a high clock [necessitating
deep pipelining] to recommend it.
always been a little dubious how much gain they actually get
simply from micro-architecture in the real world.

Very true. Even the lame P4 can excel at certain linear
crunches. It was designed to [multimedia].
But I think the real story here is Core 2's cache. Intel
is managing to get the same levels of latency from Core 2
that AMD gets from AMD64, even without an inboard memory
controller! It's likely that Core 2 is driving close to
maximum performance out of its FSB, more often than any
previous Intel architecture.

I think not. The low apparent latency most likely is due to
intelligent [MCH] prefetch. Not a bad thing, but no substitute
for the real thing when doing unpredictable hop-scotching
like traversing a relational database.

Still, this Intel Core2 looks very good, and I expect it
to be competitive or beat the AMD K7 clock-for-clock
on most [linear] benchmarks. I expect it will only fail
on pseudorandom chases. Unless it has lame multipliers.

-- Robert
 
Sure looks like Intel has leapfrogged AMD as badly as AMD had previously
leapfrogged Intel. The only problem I see though is that Intel isn't
expecting to have a lot of Core 2 Duos available for a while. Only 25%
of its production is going to be of this generation, the remaining 75%
will still be of the old Netburst generation. This means that it's going
to be selling tons of cheap undesirable Netburst processors at firesale
prices, which will result in a pricing war.

AnandTech: Intel Core 2 Extreme X6800 Preview from Taiwan
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2771

Who could think just a couple of years ago that Intel would have a CPU
performing better than AMD at lower clock? And that this CPU will be
held back only by manufacturing?
However INTC is at its lowest in 3 years - $17.39 and sinking. In
fact, it sunk so deep that it might become a good investment just
before the official Core2 release - about a month from now. And
before that... Even $15 might be possible.
Meanwhile AMD is also sliding even faster than I expected. I thought
it would be around 25 by year end - now it looks more like low 20s...

NNN
 
Excelent. That is exactly what I was looking for.
The short summary appears to be:
- many significant architecture improvements (main points are 4
operations vs 3 per cycle (~25% improvement), fusion of several external
ops into one internal (~10% improvement))
- higher frequency capability due to 65 nm process
- better use of L1 cache due to shared access between 2 CPUs
- more power management and 65 nm results in better efficiency
did I miss something critical?

That's shared *L2* cache, which, in a kinda brute force way, is probably
the most important key: even a single task has up to 4MB of L2 cache, i.e.
huge in comparison with anything else. I'm not sure the micro-architecture
enhancements are really that significant - e.g., how often is a typical app
going to have just the right mix/order of instructions to exercise all 4
operation paths simultaneously?

On the subject of memory latency, you have to consider whether you want to
classify speculative tricks which are trying to predict pseudo-random
memory access strides as a valid solution to umm, "latency". IMO it just
means that the latency benchmarks need to be rewritten.:-)
 
However INTC is at its lowest in 3 years - $17.39 and sinking. In
fact, it sunk so deep that it might become a good investment just
before the official Core2 release - about a month from now. And
before that... Even $15 might be possible.
Meanwhile AMD is also sliding even faster than I expected. I thought
it would be around 25 by year end - now it looks more like low 20s...

Intel's new price war and dumping of P4 chips at bargain basement
prices will hurt their stock short term, as will people holding off
for Conroe, but it seems likely it will hurt AMD more over the long
term.

Intel has much more cash to ride out any price war, the demand for
Conroe and related chips will continue to keep their fabs full, and
they're still expanding 300mm capacity. AMD is much more dependent on
their margins for cash flow, has relatively high startup costs in
getting their new fab online, and is much more at risk for margin
loss.

Their salvation in recent years has been the ability to compete in the
high-margin segments, and that's about to come under a pretty serious
attack. Intel sleeps sometimes, but when they wake up, they can be
formidable.

Whether AMD will stumble in bringing up the new fab is also a wild
card - it's happened before. Should be interesting times.

max
 
Intel has much more cash to ride out any price war, the demand for
Conroe and related chips will continue to keep their fabs full, and
they're still expanding 300mm capacity. AMD is much more dependent on
their margins for cash flow, has relatively high startup costs in
getting their new fab online, and is much more at risk for margin
loss.

Their salvation in recent years has been the ability to compete in the
high-margin segments, and that's about to come under a pretty serious
attack. Intel sleeps sometimes, but when they wake up, they can be
formidable.

Whether AMD will stumble in bringing up the new fab is also a wild
card - it's happened before. Should be interesting times.

Good for us "consumers" though. The AMD prices have been outrageous
recently.

Regards,
Evgenij
 
Mark said:
Not at all. There have been many benchmarks showing that the increase in
performance of increasing the cache is relatively minor in the majority of
situations. It is true that "micro-architectural improvements" generally
lead to relatively small increases in performance. The one big difference
that has always been associated with major leaps in performance is when you
increase the number of instructions per clock. The reason that Conroe is so
exciting is that it does increase the IPC (for the first time since the
pentium pro?). An anandtech article states that "It can decode 4 x86
instructions per cycle, and sometimes 5 thanks to x86 fusion. AMD's Hammer
can do only 3." This is undoubtedly the major reason for the Core being so
much faster.

4 or 5 instructions/cycle is just a theoretical maximum. Most
architectures don't live upto their theoretical maximums. However, it's
possible that an improvement from 3 to 4 theoretical IPC might lead to
an improvement from 2.8 to 2.9 real IPC (for example), so it's still an
improvement in reality. However, I'm still skeptical, considering all of
the theoretical drivel we heard about P4's IPC.

I still think the real story here is the cache. It's shared between the
cores and dual-ported. They must've come up with a really good pre-fetch
algorithm. All of the cache snooping traffic between cores is eliminated
when you share the cache. However, it might lead to more security
breaches between threads running on separate cores.

Yousuf Khan
 
[email protected] says... said:
I still think the real story here is the cache. It's shared between the
cores and dual-ported. They must've come up with a really good pre-fetch
algorithm. All of the cache snooping traffic between cores is eliminated
when you share the cache.

I think you may be right here, but we'll see.
However, it might lead to more security
breaches between threads running on separate cores.

How so? If the thread is allowed to access the physical address,
whether it's cached or not (or in another cache) doesn't seem to be
much of difference, security wise anyway.
 
Yousuf Khan said:
4 or 5 instructions/cycle is just a theoretical maximum. Most
architectures don't live upto their theoretical maximums. However, it's
possible that an improvement from 3 to 4 theoretical IPC might lead to an
improvement from 2.8 to 2.9 real IPC (for example), so it's still an
improvement in reality. However, I'm still skeptical, considering all of
the theoretical drivel we heard about P4's IPC.

You really seem to be twisting things around to try and support your view.
Apparently, a 3 IPC theoretical processor is able to hit 2.8 in actuality.
This reflects close to maximum theoretical efficiency. When you then bump
the IPC from 3 to 5?, you claim the actual IPC only increases by .1 to 2.9.
Why, if it is so efficient for 3, is it so massively inefficient for 4 and
5? I would agree that as you add IPCs, then the relative increase in
performance would reflect diminishing returns, but I don't see why you are
claiming that 3 is close to ideal, but 4 and 5 are practically useless.

I think a much better speculation would be (in general terms) that a 3 IPC
processor does on average 2 IPC in reality, and then the Conroe with 5 IPC
in theory does 3 IPC in reality. You still have diminishing returns through
adding extra IPCs, ie, the value of each IPC is .66 for the 3 IPC, but
adding 2 more IPC leads to only .5 for each, but it does make a significant
difference all the same. This would lead to the prediction that Conroe would
be 50% faster. It isn't, and suggests that the relative efficiency of adding
more IPCs is even less. However, it is easy to believe that a 25%
performance improvement would be in the scope of the changes in IPC, and
other changes in the processor are like other microarchitectural changes -
generally leading to quite small increases in performance.
 
Not at all. There have been many benchmarks showing that the increase in
performance of increasing the cache is relatively minor in the majority of
situations. It is true that "micro-architectural improvements" generally
lead to relatively small increases in performance. The one big difference
that has always been associated with major leaps in performance is when you
increase the number of instructions per clock. The reason that Conroe is so
exciting is that it does increase the IPC (for the first time since the
pentium pro?). An anandtech article states that "It can decode 4 x86
instructions per cycle, and sometimes 5 thanks to x86 fusion. AMD's Hammer
can do only 3." This is undoubtedly the major reason for the Core being so
much faster.

What utter rubbish.... and the best ref you have is Anand, who doesn't even
appear to know what latency actually means, nor to be capable of finding a
benchmark to test for it. 4/3 decodes is a minor issue. I have to ask if
you've ever seen the instruction sequences which get thrown at the CPU?
Have you ever looked at a compiler output and/or tried to write "optimized"
x86-32 code? The biggest thing here is the brute force effect of the
shared L2 cache.
 
More than a few. There's a complete extra ALU+SSE2 _and_
the issue port to run it. With very tight (library) ASM, this
core will be quad issue. Most likely not on compiled x86-32
code (maybe on x86-64) because of only a single load port.
What wasn't mentioned were the multiplier(s).

The read/write reorder buffer algorithm has been significantly
altered, most likely for the better.

Again I see this as having more impact in x64.
always been a little dubious how much gain they actually get
simply from micro-architecture in the real world.

Very true. Even the lame P4 can excel at certain linear
crunches. It was designed to [multimedia].

Odd how Trace Cache has been disowned though - no? I always thought that
was a good idea.
But I think the real story here is Core 2's cache. Intel
is managing to get the same levels of latency from Core 2
that AMD gets from AMD64, even without an inboard memory
controller! It's likely that Core 2 is driving close to
maximum performance out of its FSB, more often than any
previous Intel architecture.

I think not. The low apparent latency most likely is due to
intelligent [MCH] prefetch. Not a bad thing, but no substitute
for the real thing when doing unpredictable hop-scotching
like traversing a relational database.

But if you don't have a humungous cache, speculative prefetch can easily
get a wee bit destructive. Of course, now "we" also need to work a bit
harder on progs for "real" latency... maybe even some structures/code which
are designed to bring Conroe to its knees.:-)
Still, this Intel Core2 looks very good, and I expect it
to be competitive or beat the AMD K7 clock-for-clock
on most [linear] benchmarks. I expect it will only fail
on pseudorandom chases. Unless it has lame multipliers.

Ya mean K8 there but yes it does look like the frog has been umm, leapt...
though there *is* just a hint of maybe jumping the gun on this - I wonder.
Let's hope that a suitable response is not long in coming. We also now
know that some of the early reports were BS.
 
In comp.sys.ibm.pc.hardware.chips George Macdonald said:
Again I see this as having more impact in x64.

Perhaps, but it looks more size neutral to me. Shorten
the stall in excahnge for an occasional bigger rollback.
Odd how Trace Cache has been disowned though - no? I always
thought that was a good idea.

Agreed. That was one good thing in P4, although the single
decoder would add to the mispredicted branch stall. Still,
fetching code from memory (and maybe L2) costs more time than
executing it! So optimizing one-thru code is uninteresting.
But if you don't have a humungous cache, speculative prefetch
can easily get a wee bit destructive.

Certainly! It probably takes a while to kick in.
Of course, now "we" also need to work a bit harder on progs
for "real" latency... maybe even some structures/code which
are designed to bring Conroe to its knees.:-)

Beating that prefetch should be fairly easy. Fragment memory
a bit, then hopscotch between pages. Maybe even run a PRNG.
Ya mean K8 there but yes it does look like the frog has

Yes. Sorry for the typo.
been umm, leapt... though there *is* just a hint of maybe
jumping the gun on this - I wonder. Let's hope that a
suitable response is not long in coming. We also now know
that some of the early reports were BS.

We will certainly see. Unless someone has to make a big bet
right now, the hype doesn't matter. It will all come out.
I do remember the P4 launch was really hyped, and the chip
delivered was even more crippled than imagineable.

-- Robert
 
Perhaps, but it looks more size neutral to me. Shorten
the stall in excahnge for an occasional bigger rollback.

I'm thinking more visible registers for the programmer, to write code which
is more amenable to "hoisting" of memory->register moves. E.g. 6 regs to
play with inhibits things like loop unrolling.
We will certainly see. Unless someone has to make a big bet
right now, the hype doesn't matter. It will all come out.
I do remember the P4 launch was really hyped, and the chip
delivered was even more crippled than imagineable.

I still think the P4 was built as a DRDRAM "machine" - had some great
things, like the quad-clocked FSB with dynamic bus inversion but then some
curious retrograde things, like a Write Through 8K L1 DCache.

It's gonna be really interesting to see what AMD does with the extra
real-estate from 65nm.
 
Keith said:
How so? If the thread is allowed to access the physical address,
whether it's cached or not (or in another cache) doesn't seem to be
much of difference, security wise anyway.

There was this pretty well-known security bulletin from last year, which
was based on Hyperthreading, because the threads shared the same cache.
The same sort of thing could possibly happen under a shared-cache dual-core.

Study: Intel's hyperthreading could expose servers
http://www.computerworld.com/securitytopics/security/story/0,10801,101769,00.html

Yousuf Khan
 
Back
Top