AMD quad cores: the whole story unfolded

  • Thread starter Thread starter YKhan
  • Start date Start date
Y

YKhan

Charlie at the Inq is attempting to clear up the AMD quad-core roadmap,
removing all confusing stories (which mainly he himself started) about
the timetable for 4-core introduction. Although he tried to clear up
the confusion, I don't think that guy can write two sentences without
them contradicting each other. Hard skill to acquire, even harder to
get rid of. So I'll summarize even more, and then you can go and read
the article:

-2007Q2: Barcelona core, Rev. H (aka K8L), HT2.0
-2007Q4: Budapest core, same Rev. H, HT3.0 for Socket AM2.
-2008Q1: Shanghai core, same as Budapest core for Socket F.

Apparently the big news here is the introduction of HT3.0. It's a big
enough change that it expects its partners to miss chipset introduction
schedules, getting products out to take advantage of the new interface.
So it's initially going to introduce Rev H with an old-fashioned HT2.0
interface. So it'll synchronize introduction of the Budapest and
Shanghai when all partners are ready with their appropriate products,
but in the meantime we'll still be able to see Rev. H in action.

AMD quad cores: the whole story unfolded
"Barcelona, Shanghai, Budapest and 65 nanometre"
http://www.theinquirer.net/default.aspx?article=34433
 
Apparently the big news here is the introduction of HT3.0. It's a big
enough change that it expects its partners to miss chipset introduction
schedules, getting products out to take advantage of the new interface.

Huh?? A year+ isn't enough time to adjust to that??







Lumber Cartel (tinlc) #2063. Spam this account at your own risk.

This sig censored by the Office of Home and Land Insecurity...

Remove XYZ to email me
 
Charlie at the Inq is attempting to clear up the AMD quad-core roadmap,
removing all confusing stories (which mainly he himself started) about
the timetable for 4-core introduction. Although he tried to clear up
the confusion, I don't think that guy can write two sentences without
them contradicting each other. Hard skill to acquire, even harder to
get rid of.

Agreed, the guy can create more confusion in one para than anybody else
I've read.
So I'll summarize even more, and then you can go and read
the article:

-2007Q2: Barcelona core, Rev. H (aka K8L), HT2.0
-2007Q4: Budapest core, same Rev. H, HT3.0 for Socket AM2.
-2008Q1: Shanghai core, same as Budapest core for Socket F.

Apparently the big news here is the introduction of HT3.0. It's a big
enough change that it expects its partners to miss chipset introduction
schedules, getting products out to take advantage of the new interface.
So it's initially going to introduce Rev H with an old-fashioned HT2.0
interface. So it'll synchronize introduction of the Budapest and
Shanghai when all partners are ready with their appropriate products,
but in the meantime we'll still be able to see Rev. H in action.

AMD quad cores: the whole story unfolded
"Barcelona, Shanghai, Budapest and 65 nanometre"
http://www.theinquirer.net/default.aspx?article=34433

I still don't get what is happening with dual-core?... if anything. Does
this mean that AMD has no planned dual-core part at 65nm... or just that
all parts will be targeted at quad-core and the duals will be the failed
quad parts? I *hope* this is wrong.
 
George said:
Agreed, the guy can create more confusion in one para than anybody else
I've read.

I still don't get what is happening with dual-core?... if anything. Does
this mean that AMD has no planned dual-core part at 65nm... or just that
all parts will be targeted at quad-core and the duals will be the failed
quad parts? I *hope* this is wrong.

See, there's Charlie-derived confusion already! :-)

I think all this means is that they're only talking about the plans for
4-core, since that's what most people are interested in. I'm sure the
dual-cores are coming out in Rev. H form too, but nobody is worried
about those.

Yousuf Khan
 
The said:
Huh?? A year+ isn't enough time to adjust to that??

Who knows, but AMD is apparently planning for it, by having the
contingency plan ready. Well, I guess you can't call it a contingency
plan, since it's actually the main plan. What's HT3.0 supposed to have
anyways that's not in HT2.0? The only thing I've heard about is that
it's going allow for Hypertransport cables to connect between system
boards. This seems mainly useful for server situations. What's it going
to be good for in the PC realm, other than being faster?

Yousuf Khan
 
See, there's Charlie-derived confusion already! :-)

I didn't think I was losing my interpretive skills... but.:-0 You must
have seen this one: http://www.theinquirer.net/default.aspx?article=33906
where after 6 paras or so of talking about quad-core he states: "Now, you
notice that covers 2C chips said:
I think all this means is that they're only talking about the plans for
4-core, since that's what most people are interested in. I'm sure the
dual-cores are coming out in Rev. H form too, but nobody is worried
about those.

I just hope that the dual cores are not going to be squeezed on L2 cache so
that four cores can fit on a die. Personally I'm *not* "interested" - I
fail to see how four cores is going to be a big advantage to anybody on
desktop; software is going to take years to catch up, if ever. I'm
beginning to think I *might* be disappointed by AMD's first 65nm efforts.
 
George said:
I didn't think I was losing my interpretive skills... but.:-0 You must
have seen this one: http://www.theinquirer.net/default.aspx?article=33906
where after 6 paras or so of talking about quad-core he states: "Now, you
notice that covers 2C chips, what about QC/4C?"<gawp>

Nah, didn't see that one, thank god. I'd say his latest piece of work
supercedes that one anyways. :-)
I just hope that the dual cores are not going to be squeezed on L2 cache so
that four cores can fit on a die. Personally I'm *not* "interested" - I
fail to see how four cores is going to be a big advantage to anybody on
desktop; software is going to take years to catch up, if ever. I'm
beginning to think I *might* be disappointed by AMD's first 65nm efforts.

What do you mean by "squeezed on L2 cache"?

Yousuf Khan
 
What do you mean by "squeezed on L2 cache"?

I'd guess he means he hopes that AMD is not skimping on cache in favor
of extra cores. To be honest, I think that AMD would be fine with
512KB L2/core. The 1MB is obviously better, but I think performance
would be alright with 512KB/core, especially if there are robust
mechanisms for communication between the different caches.

To be honest, I'm pretty darn confused about AMD's roadmap myself. I
know there are quad cores out there next year toward the middle of the
year, but...that's all I'm sure of.

DK
 
I'd guess he means he hopes that AMD is not skimping on cache in favor
of extra cores. To be honest, I think that AMD would be fine with
512KB L2/core. The 1MB is obviously better, but I think performance
would be alright with 512KB/core, especially if there are robust
mechanisms for communication between the different caches.

Yep on the first comment.

Hmmm, to compete against Conroe and off-spring I think they have to
consider bigger than 512KB; I'm still convinced that Conroe's most
spectacular performance is helped considerably by the 4MB available for
each core. With their exclusive caching scheme, I don't see a unified L2
being a practical route for AMD.
To be honest, I'm pretty darn confused about AMD's roadmap myself. I
know there are quad cores out there next year toward the middle of the
year, but...that's all I'm sure of.

According to that latest "from the horse's mouth" Inquirer article, the
65nm Rev H core, in quad form, is in full line production now with finished
wafers expected in December. That would mean that 2Q07 is a reasonable
target for "availability" and that the 65nm Rev F shrink was a red
herring... but then again, you always have to read between Charlie's
(garbled) lines.:-)
 
What do you mean by "squeezed on L2 cache"?
Yep on the first comment.

Hmmm, to compete against Conroe and off-spring I think they have to
consider bigger than 512KB;

Remember, most conroes are 2MB, not 4MB. It would be sufficient to
have a couple of FX models with larger caches to compete with the
Conroe XE. Of course, I'm sure you're mention that 4x4 should take on
that role ; )

Obviously 1MB L2/core would be better, but I don't know how feasible
that is for a quad core part. I think that would be pretty unhappy for
the MFG guys.
I'm still convinced that Conroe's most
spectacular performance is helped considerably by the 4MB available for
each core.

That is certainly a component. However, I think there are a lot of
other factors. The folks I know are *very* impressed by the
prefetching capabilities.
With their exclusive caching scheme, I don't see a unified L2
being a practical route for AMD.

It would eat up a lot of bandwidth, yes. It's unclear to me exactly
how they plan to do the L3 cache. I like caches with write-through
(i.e. inclusion) a lot for the purposes of coherency, which is of
growing importance for CMP designs. However, I think non-exclusive,
non-inclusive caches are fine too (replicate L1 tags for the same
effect). Exclusive caches I don't really like much because I feel it
gives up a lot on bandwidth. Another problem is that generally you
want different levels of the cache hierarchy to be at least a factor of
8 larger/smaller in size for inclusion.

I don't think AMD wants to shrink their L1, which means they are stuck
with an exclusive L2 unless it's 1MB or larger.

DK
 
chrisv said:
Today, yes. 6 months from now, when the 4MB cache parts are $100 less
than they are today?

Yes, and the 4MB parts are just now starting to come online finally.
Price reductions are on their way.

Yousuf Khan
 
Remember, most conroes are 2MB, not 4MB. It would be sufficient to
have a couple of FX models with larger caches to compete with the
Conroe XE. Of course, I'm sure you're mention that 4x4 should take on
that role ; )

I'm not sure what that last sentence means with the typo an' all but if
you're suggesting that I am "hot" for 4x4, see my post just above. Even if
4x4 *can* find a niche, I don't care - it's of no interest to me. Intel's
strategy makes more sense: use the extra 65nm real estate to augment the
core with clever logic, more L2 and screw the single-die quad core.
Obviously 1MB L2/core would be better, but I don't know how feasible
that is for a quad core part. I think that would be pretty unhappy for
the MFG guys.


That is certainly a component. However, I think there are a lot of
other factors. The folks I know are *very* impressed by the
prefetching capabilities.

The prefetching, speculative stride size, speculative load re-ordering and
memory disambiguation all depend on the L2 cache not getting munged due to
mispredictions - umm, bigger is better.:-)
It would eat up a lot of bandwidth, yes. It's unclear to me exactly
how they plan to do the L3 cache. I like caches with write-through
(i.e. inclusion) a lot for the purposes of coherency, which is of
growing importance for CMP designs. However, I think non-exclusive,
non-inclusive caches are fine too (replicate L1 tags for the same
effect). Exclusive caches I don't really like much because I feel it
gives up a lot on bandwidth. Another problem is that generally you
want different levels of the cache hierarchy to be at least a factor of
8 larger/smaller in size for inclusion.

I think there are additional logic problems with a unified L2 and exclusive
L1/L2 to preserve exclusivity... which could be wasteful: e.g. as an
extreme case, all L1s have the same cache line in L1 - one "evicts" I'd
think the others would have to mark the L1 copy as invalid. Am I missing
something here?
I don't think AMD wants to shrink their L1, which means they are stuck
with an exclusive L2 unless it's 1MB or larger.

You mean the total L1+L2 is required to have a decent amount of cache
on-die?.. makes sense. I still don't get why they persist with 2-way set
associative L1. In fact if they'd increased the default page size for
64-bit and made it 8-way they could have done something about their L1
latency.
 
George said:
[snip]
Remember, most conroes are 2MB, not 4MB. It would be sufficient to
have a couple of FX models with larger caches to compete with the
Conroe XE. Of course, I'm sure you're mention that 4x4 should take on
that role ; )

I'm not sure what that last sentence means with the typo an' all but if
you're suggesting that I am "hot" for 4x4, see my post just above.

I meant 'I'm sure you'll mention'...
Even if
4x4 *can* find a niche, I don't care - it's of no interest to me. Intel's
strategy makes more sense: use the extra 65nm real estate to augment the
core with clever logic, more L2 and screw the single-die quad core.

I totally agree. I'd much rather seem MCMs than weird motherboards.
The prefetching, speculative stride size, speculative load re-ordering and
memory disambiguation all depend on the L2 cache not getting munged due to
mispredictions - umm, bigger is better.:-)

Well, I'm not entirely sure how prefetching works. I would imagine
that they probably have a small separate prefetch buffer, as well as a
streaming load buffer.

Bigger is always better for caches though if you can keep the access
time down.
I think there are additional logic problems with a unified L2 and exclusive
L1/L2 to preserve exclusivity... which could be wasteful: e.g. as an
extreme case, all L1s have the same cache line in L1 - one "evicts" I'd
think the others would have to mark the L1 copy as invalid. Am I missing
something here?

I can think of ways around this problem, but they are complicated and
ugly. Generally victim buffers (i.e. the K7/8's L2 cache) are the last
level of cache, take the POWER4 for instance (perhaps the POWER5 as
well).
You mean the total L1+L2 is required to have a decent amount of cache
on-die?.. makes sense.

No, I mean if you were to have a write through L1 (i.e. L2 includes
L1), then a 128KB L1 is bad news for an L2 cache that is under 1MB. If
you had a 512KB L2, then you'd basically be wasting about 1/4 of it on
inclusion, which is really quite bad.

If it's non-exclusive, non-inclusive it might be alright, but you still
run the risk of having problems with too much replication.
I still don't get why they persist with 2-way set
associative L1.

Because the advantage of higher associativity rapidly falls off above 2
or 4 when the cache is that big. Basically, the probability of a
conflict in a 32KB cache is a lot larger than in a 128KB cache (4x as
many memory locations map to a cache line in the 32KB cache) and
associativity is fundamentally about reducing conflict misses.

Also, I suspect AMD's L1 cache is probably in the critical path. More
associativity = lower frequency.
In fact if they'd increased the default page size for
64-bit and made it 8-way they could have done something about their L1
latency.

Huh? Could you elaborate?

DK
 
George said:
[snip]
Remember, most conroes are 2MB, not 4MB. It would be sufficient to
have a couple of FX models with larger caches to compete with the
Conroe XE. Of course, I'm sure you're mention that 4x4 should take on
that role ; )

I'm not sure what that last sentence means with the typo an' all but if
you're suggesting that I am "hot" for 4x4, see my post just above.

I meant 'I'm sure you'll mention'...

Uhh, so yes you were presuming too much and missing my previous comments on
4x4. I'm not so negative as you - I believe it could find a place in
gaming, if the game prgrammers really get to grips with multi-core; the
small server mentioned by others is umm, not impossible.:-)
I totally agree. I'd much rather seem MCMs than weird motherboards.

I think you mean MCP which is what Intel has for quad-core... not what is
normally known as MCM, which would be horribly expensive from what I hear.

As for weird motherboards I've been intrigued by some of the *amazing*
(extravagant ?) results claimed by http://www.siliconpipe.com/ - it's just
err, remotely possible that their unnamed licensee, signed Aug 21, could be
AMD... which would be quite a revolution in high-speed interconnect: i.e.
stick a HT link on a foil ribbon and the mbrd is less "weird". Note that
Intel has been fiddling with this stuff too:
http://www.edn.com/article/CA6362694.html
Well, I'm not entirely sure how prefetching works. I would imagine
that they probably have a small separate prefetch buffer, as well as a
streaming load buffer.

That was my point: AIUI, the prefetch is initially done to the L2; if
there's a mispredict, e.g. on load re-order, it just sits there till it
goes stale and gets overwritten; with aggressive
prefetch/speculation/prediction the cache needs to be large to keep the
%age of poisoning down - below a certain size predictions get
self-defeating.
Bigger is always better for caches though if you can keep the access
time down.


I can think of ways around this problem, but they are complicated and
ugly. Generally victim buffers (i.e. the K7/8's L2 cache) are the last
level of cache, take the POWER4 for instance (perhaps the POWER5 as
well).

I think "wasteful" covers" complicated & ugly".:-)
No, I mean if you were to have a write through L1 (i.e. L2 includes
L1), then a 128KB L1 is bad news for an L2 cache that is under 1MB. If
you had a 512KB L2, then you'd basically be wasting about 1/4 of it on
inclusion, which is really quite bad.

Hmm, I don't think write through L1 is a good idea -- c.f. P4 -- and is not
necessary for non-exclusive of course.
If it's non-exclusive, non-inclusive it might be alright, but you still
run the risk of having problems with too much replication.

I believe C2D is L1 write back, non-exclusive L2, which is also write back,
much the same as P3/P-M was.
Because the advantage of higher associativity rapidly falls off above 2
or 4 when the cache is that big. Basically, the probability of a
conflict in a 32KB cache is a lot larger than in a 128KB cache (4x as
many memory locations map to a cache line in the 32KB cache) and
associativity is fundamentally about reducing conflict misses.

128KB split as I+D is not that big, even compared with C2D which is 64KB
split I+D and is 8-way. In my analysis, and with some real programs, you
really want at least 3 ways to avoid collisions and churning for many
tasks. With only 2 ways and "perfectly aligned" arrays, you can get quite
a mess - programmers should not have to deal with this, though they often
get a kick out of working around it.:-)
Also, I suspect AMD's L1 cache is probably in the critical path. More
associativity = lower frequency.


Huh? Could you elaborate?

With a way-size which is <=page size, the cache tags can be looked up
without waiting for the TLB look-up result... i.e. the two can be done in
parallel, followed by the compare(s). I believe Sun had some patents to
mitigate this but I have no way to know if AMD adopted anything similar;
still, sequential operation must contribute to overall latency. Note:
Intel has always kept their way-size to 4KB or less.
 
George said:
Uhh, so yes you were presuming too much and missing my previous comments on
4x4. I'm not so negative as you - I believe it could find a place in
gaming, if the game prgrammers really get to grips with multi-core; the
small server mentioned by others is umm, not impossible.:-)

The small server really is not going to happen.
I think you mean MCP which is what Intel has for quad-core... not what is
normally known as MCM, which would be horribly expensive from what I hear.

MCP and MCM are similar. IBM has set the connotation for what MCM
means with their POWER4/5 packaging, but IIRC, the original P6 was
referred to as an MCM device. But frankly, it doesn't matter much, we
both know what we are talking about.
As for weird motherboards I've been intrigued by some of the *amazing*
(extravagant ?) results claimed by http://www.siliconpipe.com/ - it's just
err, remotely possible that their unnamed licensee, signed Aug 21, could be
AMD... which would be quite a revolution in high-speed interconnect: i.e.
stick a HT link on a foil ribbon and the mbrd is less "weird". Note that
Intel has been fiddling with this stuff too:
http://www.edn.com/article/CA6362694.html

I don't exactly what sort of physical stuff they are signalling over.
IIRC though, Intel had some papers about the low level physicals that
were 20gbps at ISSCC. Just because they can clock that high doesn't
mean that they will xmit data at that rate.
That was my point: AIUI, the prefetch is initially done to the L2; if
there's a mispredict, e.g. on load re-order, it just sits there till it
goes stale and gets overwritten; with aggressive
prefetch/speculation/prediction the cache needs to be large to keep the
%age of poisoning down - below a certain size predictions get
self-defeating.

It depends what pattern is being prefetched. For streaming stuff, it
shouldn't be put in the L2, but in a streaming buffer.
I think "wasteful" covers" complicated & ugly".:-)


Hmm, I don't think write through L1 is a good idea -- c.f. P4 -- and is not
necessary for non-exclusive of course.

Write through means you don't need ECC, and it means you don't have to
probe that level of cache for coherency. Niagara I and II both use
write through, and there are very good reasons for it.
I believe C2D is L1 write back, non-exclusive L2, which is also write back,
much the same as P3/P-M was.
Yes:
http://realworldtech.com/page.cfm?ArticleID=RWT030906143144&p=7


128KB split as I+D is not that big, even compared with C2D which is 64KB
split I+D and is 8-way. In my analysis, and with some real programs, you
really want at least 3 ways to avoid collisions and churning for many
tasks. With only 2 ways and "perfectly aligned" arrays, you can get quite
a mess - programmers should not have to deal with this, though they often
get a kick out of working around it.:-)

Sure, I'm not going to argue that associativity is bad, but I think
that the limiter may be cycle time. There's not a chance in hell I'd
go for a 4 cycle L1 cache to get extra associativity.
With a way-size which is <=page size, the cache tags can be looked up
without waiting for the TLB look-up result... i.e. the two can be done in
parallel, followed by the compare(s).

Right, I vaguely remember that trick, although not the logic behind it.
I should probably sit down and read that section of H&P again.
I believe Sun had some patents to
mitigate this but I have no way to know if AMD adopted anything similar;
still, sequential operation must contribute to overall latency. Note:
Intel has always kept their way-size to 4KB or less.

Interesting, that's certainly true for L1 caches, and that makes sense.

DK
 
The small server really is not going to happen.


MCP and MCM are similar. IBM has set the connotation for what MCM
means with their POWER4/5 packaging, but IIRC, the original P6 was
referred to as an MCM device. But frankly, it doesn't matter much, we
both know what we are talking about.

No they are not even close. I admit I don't know how far MCM goes back but
certainly DEC, when they were still called that, used a largish MCM in the
DECStation 5000 series. They've also been used in high-end audio, I
suspect to hide the cheap op-amps inside, but I'd think the authority on
this would be the mainframe guys.
I don't exactly what sort of physical stuff they are signalling over.
IIRC though, Intel had some papers about the low level physicals that
were 20gbps at ISSCC. Just because they can clock that high doesn't
mean that they will xmit data at that rate.

Read the links - Intel's "experiment" was more than just a clock; the data
integrity has been proven at some level: prototype at least AIUI, and
further if Siliconpipe is to be believed and they have an undisclosed
licensee. The signalling is over flat ribbon cables, somewhat similar in
appearance to what you'll find in any laptop you crack open, from what I
understand, though there has to be a wee bit more to it.
It depends what pattern is being prefetched. For streaming stuff, it
shouldn't be put in the L2, but in a streaming buffer.

I dont program "streaming" - a special case to me:-); certainly for
predictive and speculative it'd be foolish to risk poisoning L1 and CPU
load buffers.
 
Back
Top