Macro-Op fusion does not work in 64-bit mode

  • Thread starter Thread starter YKhan
  • Start date Start date
Y

YKhan

"The thing is though, while MOF may be touted as the best thing since
sliced bread, it does not cause many performance problems when it is
off. It appears that the bottleneck in the CPU is not in that aspect of
the pipeline, so its loss has little speed impact. More on this when
the testing is complete."
http://www.theinquirer.net/default.aspx?article=33347

Macro-op Fusion was one of the big hype items of the
Conroe/Merom/Woodcrest. This feature is supposed to be one of the
things giving Intel it's edge over AMD in the performance wars. Now it
turns out that it doesn't even work in 64-bit mode. But apparently it's
no big deal. Most of us have already figured out that the real secret
behind CMW is its big L2 cache, but Intel downplayed that. So Intel
can't have it both ways, either MOF is important, and Intel will have
to explain why it isn't available when in 64-bit mode and why CMW is
crippled in that mode? Or MOF isn't important, and Intel has to admit
that it's all due the cache.

Yousuf Khan
 
"The thing is though, while MOF may be touted as the best thing since
sliced bread, it does not cause many performance problems when it is
off. It appears that the bottleneck in the CPU is not in that aspect of
the pipeline, so its loss has little speed impact. More on this when
the testing is complete."
http://www.theinquirer.net/default.aspx?article=33347

Macro-op Fusion was one of the big hype items of the
Conroe/Merom/Woodcrest. This feature is supposed to be one of the
things giving Intel it's edge over AMD in the performance wars. Now it
turns out that it doesn't even work in 64-bit mode. But apparently it's
no big deal. Most of us have already figured out that the real secret
behind CMW is its big L2 cache, but Intel downplayed that. So Intel

Actually I've been rather adamant that there are a LOT of factors that
are affecting performance in the Core architecture. Sure, the extra
cache helps. Faster bus speed helps too, and more pipelines, better
decoders, an excellent brand predictor, improved TLB and hey, even
Macro-Op Fusion, just to name a few. Take away any one of these and
you are going to lose some performance. Going from 4MB to 2MB of
cache costs about 3.5% performance (see:
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2795&p=4 ), while
1MB or L2 would probably drop performance further. Substantial yes,
but not nearly enough to make up for the improvements vs. either the
Athlon64 X2 or the Core Duo (Yonah) chips before it.
can't have it both ways, either MOF is important, and Intel will have
to explain why it isn't available when in 64-bit mode and why CMW is
crippled in that mode? Or MOF isn't important, and Intel has to admit
that it's all due the cache.

Or they just tell the truth that Macro-Op Fusion is just one of many
features that helps performance. It's also supposed to reduce power
consumption slightly. In all it's damn near impossible to predict
just how much the loss of this one feature will really change things
though, since there are many other variables that come into play here.
 
YKhan said:
"The thing is though, while MOF may be touted as the best thing since
sliced bread, it does not cause many performance problems when it is
off. It appears that the bottleneck in the CPU is not in that aspect of
the pipeline, so its loss has little speed impact. More on this when
the testing is complete."
http://www.theinquirer.net/default.aspx?article=33347

Macro-op Fusion was one of the big hype items of the
Conroe/Merom/Woodcrest. This feature is supposed to be one of the
things giving Intel it's edge over AMD in the performance wars. Now it
turns out that it doesn't even work in 64-bit mode. But apparently it's
no big deal. Most of us have already figured out that the real secret
behind CMW is its big L2 cache, but Intel downplayed that. So Intel
can't have it both ways, either MOF is important, and Intel will have
to explain why it isn't available when in 64-bit mode and why CMW is
crippled in that mode? Or MOF isn't important, and Intel has to admit
that it's all due the cache.

Yousuf Khan

If you actually looked at the benchmarks, you would realize that the
improved performance cannot be attributed to the cache alone.
 
The said:
Is it really just the cache and nothing else? :P

Well, it might also be the predictive algorithms for populating the
cache, but that's really part of the cache.

Yousuf Khan
 
Mark said:
If you actually looked at the benchmarks, you would realize that the
improved performance cannot be attributed to the cache alone.

The cache is 4 times bigger than anything AMD has. What else would it
be? We've already shown it's not macro-op fusion.

Yousuf Khan
 
Yousuf said:
Well, it might also be the predictive algorithms for populating the
cache, but that's really part of the cache.

What about other things like the out of order load/store? That's memory
and not cache. It seems that every thing just adds a small % thus adding
up. While individually, the large cache or whatever does not appear to
be the "key" component.
 
What about other things like the out of order load/store? That's memory
and not cache. It seems that every thing just adds a small % thus adding
up. While individually, the large cache or whatever does not appear to
be the "key" component.

The out of order load/store *is* predictive, in particular the
disambiguation and was said to include speculative components, without
further elucidation by Intel. The large cache is an important part of such
a strategy to avoid/minimize negative effects. It's quite rare for
microarchitecture tweaks like op-fusion, or additional pipeline paths to
yield benefits which are consistently measurable.

I *do* wish that the benchmarkers would quit quoting "latency" performance
using a program which is now clearly insufficient for the job.
 
Well, it might also be the predictive algorithms for populating the
cache, but that's really part of the cache.

Predictive algorithms are part of the load/store or fetch units
,which the dcache and icache are part, but I wouldn't say any
prefetching was part of the cache, per se. Caches are pretty dumb.

Sorta like saying the multiply algorithm is part of the register
file...
 
Yousuf Khan said:
The cache is 4 times bigger than anything AMD has. What else would it be?
We've already shown it's not macro-op fusion.

Yousuf Khan

How much impact would something like a wider execution path make? This is
coming from someone who is more of a layman than anything else when it comes
to the specifics of how CPU's actually perform their duties, so I'm asking
out of curiosity. Having read an analysis off of the anandtech website, one
of the key architectural changes they point out is how much wider the Core 2
is compared to a PIII/P4/Ahtlon64. Core 2, for instance, is the only core
among those that can execute 128bit SSE instructions in a single cycle. Is
this the type of thing that might add up to create a real impact?

Carlo
 
Carlo said:
How much impact would something like a wider execution path make? This is
coming from someone who is more of a layman than anything else when it comes
to the specifics of how CPU's actually perform their duties, so I'm asking
out of curiosity. Having read an analysis off of the anandtech website, one
of the key architectural changes they point out is how much wider the Core 2
is compared to a PIII/P4/Ahtlon64. Core 2, for instance, is the only core
among those that can execute 128bit SSE instructions in a single cycle. Is
this the type of thing that might add up to create a real impact?

I'm sure it helps during SSE instructions. Can't see it being a big part
of the equation though, just like SSE itself isn't a big part of programs.

Yousuf Khan
 
The cache is 4 times bigger than anything AMD has.

AMD has chips with 2MB of cache (2 x 1MB) and so does Intel. Intel
chips are MUCH faster, clock for clock, when compared with equal
quantities of cache.
What else would it
be? We've already shown it's not macro-op fusion.

How about the fact that Intel has 4 instruction decoders to AMD's 3,
an extra LOAD/STORE unit, 3 fully pipelined SSE units vs. K8's 2
partially pipelined, more and better branch predictors, much larger
TLBs, larger OoO reorder buffer, more advanced scheduler... to name a
few. And that's entirely separate from the better data prefetching
and greater cache bandwidth that, as you mentioned in another message,
are all related to cache.

Besides, we don't really know how much macro-op fusion really is
helping since we haven't seen any apples to apples comparison. 32-bit
with macro-op fusion vs. 64-bit without it doesn't really help, even
if only relative to AMD's 32-bit vs. 64-bit numbers. Intel might have
just done a better implementation of 64-bit x86 (AMD's K8 does have a
compromise or two in 64-bit mode as well) and that made up for the
loss in performance from Macro-op Fusion.

Long story short, there is a LOT more to the Core architecture than
just cache. Other than the integrated memory controller, Core is a
more advanced chip start to finish when compared to AMD's K8.
Fortunately for AMD, most of these advantages are incremental in
nature and their more modular K8L design could theoretically allow
them to phase such features into future processors.
 
AMD has chips with 2MB of cache (2 x 1MB) and so does Intel. Intel
chips are MUCH faster, clock for clock, when compared with equal
quantities of cache.

I think what Yousuf is getting at is that in a single task benchmark
situation, you have 4MB of L2 cache for that single task, multithreaded or
not.
How about the fact that Intel has 4 instruction decoders to AMD's 3,
an extra LOAD/STORE unit, 3 fully pipelined SSE units vs. K8's 2
partially pipelined, more and better branch predictors, much larger
TLBs, larger OoO reorder buffer, more advanced scheduler... to name a
few. And that's entirely separate from the better data prefetching
and greater cache bandwidth that, as you mentioned in another message,
are all related to cache.

Looking back, it's not often that inner core microarchitecture tweaks have
yielded that much performance benefit. To me there are two clues here:

1) The fact that there are benchmarks where C2D shows near-zero benefit vs.
AMD64 points to the memory/cache subsytem and how it's manipulated as the
important provider of performance in the other benchmarks where C2D wins
handily. In particular, when disambiguation "hits", it hits *big*; when it
"misses", the penalty drags performance back down. When it "hits", it
depends heavily on the large cache and associativity to avoid thrashing.

2) The ridiculous C2D "latency" measurements being published, all using the
same chipset where a P4 is a latency dog, are an indication that
speculation on stride size and Load/Store re-ordering make a *huge*
contribution to performance. Of course what this really means is that the
current latency benchmark is obsolete but it makes no sense that a system
with FSB, where the real round-trip latency is illustrated by the P4
measurements, can beat a system with an on-board memory controller. Again,
without the large L2 cache, the strategy would fall down.
Besides, we don't really know how much macro-op fusion really is
helping since we haven't seen any apples to apples comparison. 32-bit
with macro-op fusion vs. 64-bit without it doesn't really help, even
if only relative to AMD's 32-bit vs. 64-bit numbers. Intel might have
just done a better implementation of 64-bit x86 (AMD's K8 does have a
compromise or two in 64-bit mode as well) and that made up for the
loss in performance from Macro-op Fusion.

Long story short, there is a LOT more to the Core architecture than
just cache. Other than the integrated memory controller, Core is a
more advanced chip start to finish when compared to AMD's K8.
Fortunately for AMD, most of these advantages are incremental in
nature and their more modular K8L design could theoretically allow
them to phase such features into future processors.

"Incremental" is correct.;-)
 
Back
Top