some Opteron / SMP questions

  • Thread starter Thread starter Dan Lenski
  • Start date Start date
D

Dan Lenski

Hi all,
I have a few (probably naive) questions about Opteron processors and
SMP that I wasn't able to clarify elsewhere:

* Is there any actual hardware difference between 1xxx, 2xxx, and
8xxx? Or is it just forced lock-out... just like how CPU
manufacturers lock multipliers, or disable one core of a dual-core
CPU? For example, the Opteron 2212 Santa Rosa is $210 from Newegg,
while the 8212 is $529. Is there any extra logic in the 8212, or have
they just flipped some bit on the cheaper processor to prevent it from
running in a 4-way or 8-way server?

* Can I just drop in an AM2 Opteron as a replacement for an Athlon 64
X2 on my desktop? I do a lot of scientific number crunching and
compiling and I think the extra cache would help. Do any particular
AM2 chipsets work better with Opteron?

* And, just out of curiosity... I've heard that Opterons can actually
be used in a >8-way configuration. How does this work? Is there some
special chipset that allows it to exceed 8 processors?

Thanks for any answers!

Dan Lenski
 
Hi all,
I have a few (probably naive) questions about Opteron processors and
SMP that I wasn't able to clarify elsewhere:

* Is there any actual hardware difference between 1xxx, 2xxx, and
8xxx? Or is it just forced lock-out... just like how CPU
manufacturers lock multipliers, or disable one core of a dual-core
CPU? For example, the Opteron 2212 Santa Rosa is $210 from Newegg,
while the 8212 is $529. Is there any extra logic in the 8212, or have
they just flipped some bit on the cheaper processor to prevent it from
running in a 4-way or 8-way server?

* Can I just drop in an AM2 Opteron as a replacement for an Athlon 64
X2 on my desktop? I do a lot of scientific number crunching and
compiling and I think the extra cache would help. Do any particular
AM2 chipsets work better with Opteron?

What sort of scientific number crunching? A lot of HPC workloads
basically ignore the cache entirely...
* And, just out of curiosity... I've heard that Opterons can actually
be used in a >8-way configuration. How does this work? Is there some
special chipset that allows it to exceed 8 processors?

Not really. There is a company which has a chipset for large Opteron
systems, but it has not been productized. Also, Opteron systems with
8 sockets pretty much suck (performance wise). You cannot push past
4S effectively...the cost of snoop-invalidate coherency is just too
high right now.

DK
 
What sort of scientific number crunching? A lot of HPC workloads
basically ignore the cache entirely...

Matrix math, graphing, data fitting... I've seen a couple LINPACK
benchmarks where performance seems to drastically decrease if a matrix
is too big to fit in cache.
Not really. There is a company which has a chipset for large Opteron
systems, but it has not been productized. Also, Opteron systems with
8 sockets pretty much suck (performance wise). You cannot push past
4S effectively...the cost of snoop-invalidate coherency is just too
high right now.

Interesting! I poked around the web and could not actually find
anywhere that sells a mobo for more than 4-way Opteron. I guess the
special many-way chipset must provide some extra inter-processor
communciations links, since the chips don't support it themselves.

Dan
 
Matrix math, graphing, data fitting... I've seen a couple LINPACK
benchmarks where performance seems to drastically decrease if a matrix
is too big to fit in cache.

What is your working set size? Caches only help when a significant
amount of your working set can be cached (note that while an app might
touch 10GB of data, it may only repeatedly touch 0.1% of that).
Interesting! I poked around the web and could not actually find
anywhere that sells a mobo for more than 4-way Opteron. I guess the
special many-way chipset must provide some extra inter-processor
communciations links, since the chips don't support it themselves.

Nope, an 8S opteron is pretty much like a 4S. The major difference is
that they require multiple system boards (2-5) to work.

However, the performance on an 8S Opteron is pretty damn atrocious.
You get a 20-30% improvement in performance for 2x the
processors...according to most benchmarks.

DK
 
What is your working set size? Caches only help when a significant
amount of your working set can be cached (note that while an app might
touch 10GB of data, it may only repeatedly touch 0.1% of that).

Typically 1-10MB, and usually the same data is processed heavily over and
over. But in large part I guess I am just wondering about what the
advantages of replacing Athlon 64 with Opteron might be :-)
Nope, an 8S opteron is pretty much like a 4S. The major difference is
that they require multiple system boards (2-5) to work.

But presumably that many-way Opteron chipset (>8) *does* require some sort
of special communications infrastructure beyond what the CPUs themselves
provide.
However, the performance on an 8S Opteron is pretty damn atrocious. You
get a 20-30% improvement in performance for 2x the
processors...according to most benchmarks.

Wow, that is bad. So it's a cache coherency problem... does it depend on
the workload at all? I mean, if it's extremely parallelized, such as
matrix multiplication or something, will the performance degradation still
occur?

Dan
 
Typically 1-10MB, and usually the same data is processed heavily over and
over. But in large part I guess I am just wondering about what the
advantages of replacing Athlon 64 with Opteron might be :-)

So an Opteron has ~1.16MB of cache, the Athlon64 probably has around
0.6MB. That could be a big difference - I'd buy two systems and
benchmark them.
But presumably that many-way Opteron chipset (>8) *does* require some sort
of special communications infrastructure beyond what the CPUs themselves
provide.

Absolutely. Opterons only use 3 bits to identify the nodeID, so you
cannot go beyond 8 MPUs without a chipset. You also need something to
keep coherency overhead under control.
Wow, that is bad. So it's a cache coherency problem... does it depend on
the workload at all?

Yes, it depends on the number of memory accesses. Generally, cache
coherency traffic increases like the square of the number of
processors.
I mean, if it's extremely parallelized, such as
matrix multiplication or something, will the performance degradation still
occur?

Each memory access forces the processor to snoop all the caches in the
system.
A 1S system generates 1 coherency message
A 2S system generates 2-3 coherency messages
A 4S system generates 7-10 coherency messages

I haven't explicitly calculated out how an 8S system works, because
there are a variety of topologies. However, you're probably looking
at a minimum of 22 and quite possibly more messages. Things get ugly
for systems that aren't fully connected...

DK
 
Dan said:
Hi all,
I have a few (probably naive) questions about Opteron processors and
SMP that I wasn't able to clarify elsewhere:

* Is there any actual hardware difference between 1xxx, 2xxx, and
8xxx? Or is it just forced lock-out... just like how CPU
manufacturers lock multipliers, or disable one core of a dual-core
CPU? For example, the Opteron 2212 Santa Rosa is $210 from Newegg,
while the 8212 is $529. Is there any extra logic in the 8212, or have
they just flipped some bit on the cheaper processor to prevent it from
running in a 4-way or 8-way server?

There is a physical packaging difference between the 1xxx Opterons and
all of the other Opterons. Specifically, the 1xxx Opterons use the AM2
socket, while the rest use Socket F.

As for a difference between the 2xxx and 8xxx Opterons? They use the
exact same socket, but I think a circuit is specifically disabled on
2xxx that allows more than one of its Hypertransport links to be a
"coherent Hypertransport" link. The coherency protocols are needed for SMP.
* Can I just drop in an AM2 Opteron as a replacement for an Athlon 64
X2 on my desktop? I do a lot of scientific number crunching and
compiling and I think the extra cache would help. Do any particular
AM2 chipsets work better with Opteron?

Yeah, the AM2 Opterons will work in any desktop board. The AM2 Opterons
are actually designed to compete in the lucrative "desktop as a server"
market segment. Intel OEMs like Dell, used to sell Celeron systems as
servers at one time.
* And, just out of curiosity... I've heard that Opterons can actually
be used in a >8-way configuration. How does this work? Is there some
special chipset that allows it to exceed 8 processors?

Yeah, you need a chipset that can act as a bridge between multiple
poly-Opteron mainboards. Each mainboard is isolated from each other by
the bridge chipset, so Opterons in one mainboard don't know there are
other Opterons in other mainboards, they just see the bridge interface,
and that's it.

Cray does it this way with its Opteron-based XT3/4 supercomputers. Sun
is also going to do it this way in its Constellation blade cluster
supercomputers. The main difference between Cray's way and Sun's way is
that Cray will bridge a single Opteron to all other Opterons, whereas
Sun will let upto 4 Opterons talk to each other directly, and bridge
anything above that.

Also in Opteron 8xxx systems, the most common number of sockets is 4 not
8. Very few Opteron 8xxx's are ever used in 8-way, they are mostly used
as 4-way. That's due to the fact that there are only 3 coherent HT links
per Opteron 8xxx CPU. With 3 links, you can only directly connect to
upto 4 processors. If you want 8 direct processors, then they to have
communicate over one extra hop in some cases, which adds latency.

Doing 8-way with a bridge chipset instead of direct connection might be
better as the bridge will filter out some of traffic, and it might cache
some stuff in between.

Yousuf Khan
 
Yousuf Khan said:
are actually designed to compete in the lucrative "desktop as a server"
market segment. Intel OEMs like Dell, used to sell Celeron systems as
servers at one time.

Did? Still do. PowerEdgeTM SC440 w/ Celeron 336 (2.8ghz so must be
netburst), still on sale as of today 7/10/2007 - although why you'd pay $384
(small business pricing) when you can get a P-D 925 and 4x the disk (2x
160gb instead of 1x 80gb).
 
There is a physical packaging difference between the 1xxx Opterons and
all of the other Opterons. Specifically, the 1xxx Opterons use the AM2
socket, while the rest use Socket F.

As for a difference between the 2xxx and 8xxx Opterons? They use the
exact same socket, but I think a circuit is specifically disabled on
2xxx that allows more than one of its Hypertransport links to be a
"coherent Hypertransport" link. The coherency protocols are needed for SMP.

Ah, very interesting! So the 2xxx/8xxx difference is basically just a
matter of good ol' retail price discrimination.
Yeah, the AM2 Opterons will work in any desktop board. The AM2 Opterons
are actually designed to compete in the lucrative "desktop as a server"
market segment. Intel OEMs like Dell, used to sell Celeron systems as
servers at one time.

Cool, that's a pretty excellent option then I guess. I suppose I
could also put a socket 939 Opteron into an older socket 939 desktop
system.
Yeah, you need a chipset that can act as a bridge between multiple
poly-Opteron mainboards. Each mainboard is isolated from each other by
the bridge chipset, so Opterons in one mainboard don't know there are
other Opterons in other mainboards, they just see the bridge interface,
and that's it.

Cray does it this way with its Opteron-based XT3/4 supercomputers. Sun
is also going to do it this way in its Constellation blade cluster
supercomputers. The main difference between Cray's way and Sun's way is
that Cray will bridge a single Opteron to all other Opterons, whereas
Sun will let upto 4 Opterons talk to each other directly, and bridge
anything above that.

Also in Opteron 8xxx systems, the most common number of sockets is 4 not
8. Very few Opteron 8xxx's are ever used in 8-way, they are mostly used
as 4-way. That's due to the fact that there are only 3 coherent HT links
per Opteron 8xxx CPU. With 3 links, you can only directly connect to
upto 4 processors. If you want 8 direct processors, then they to have
communicate over one extra hop in some cases, which adds latency.

Doing 8-way with a bridge chipset instead of direct connection might be
better as the bridge will filter out some of traffic, and it might cache
some stuff in between.

Gotcha! That's exactly what I was trying to understand. Thanks
Yousuf,

Dan
 
Nate said:
Did? Still do. PowerEdgeTM SC440 w/ Celeron 336 (2.8ghz so must be
netburst), still on sale as of today 7/10/2007 - although why you'd pay $384
(small business pricing) when you can get a P-D 925 and 4x the disk (2x
160gb instead of 1x 80gb).

I assume that extra price comes with a better business-style warranty.

Yousuf Khan
 
Dan said:
Ah, very interesting! So the 2xxx/8xxx difference is basically just a
matter of good ol' retail price discrimination.

Pretty much. It's also possible that if they do have the coherency
circuitry specifically disabled, that is a big technical reason why you
can't just simply use a 2xxx series Opteron on more than 2-way setups.

Cool, that's a pretty excellent option then I guess. I suppose I
could also put a socket 939 Opteron into an older socket 939 desktop
system.

That would pretty much be your only option left for Socket 939. Finding
a new processor for those motherboards is like finding hen's teeth,
despite the fact that there are so many of those motherboards already
installed. AMD is not producing any X2's in that format anymore, and so
Opterons are the only things available anymore. I gather it needs to
produce Opterons for some time due to commercial spare-parts contracts.

Yousuf Khan
 
Yousuf Khan said:
I assume that extra price comes with a better business-style warranty.

Same machine, AFAICT, same warranty. Both from Dell, same web page; just
different promos.

The current small business pricing is not nearly as good for the Pentium D
model, but shows the same warrantys (and the same price on the basic Celeron
model) - 1Yr BASIC SUPPORT: 5x10 HW-Only, 5x10 NBD Onsite
 
Back
Top