HP's 2-way Opteron server

Yousuf Khan · Feb 24, 2004

HP ProLiant DL145 looks somewhat similar to the IBM eServer 325, and Sun
Sunfire v20z, and Appro, and Newisys, and ... .

http://h18000.www1.hp.com/products/quickspecs/11910_div/11910_div.HTML

Yousuf Khan

Yousuf Khan · Feb 24, 2004

Oops, nearly forgot about 4-way server too.

http://h18004.www1.hp.com/products/servers/proliantdl585/index.html

Yousuf Khan

Tony Hill · Feb 24, 2004

HP ProLiant DL145 looks somewhat similar to the IBM eServer 325, and Sun
Sunfire v20z, and Appro, and Newisys, and ... .

http://h18000.www1.hp.com/products/quickspecs/11910_div/11910_div.HTML

Seems similar only in that they're all 1U dual processor Opteron
servers. There just aren't that many differences possible in such a
design! However the actual setup does look somewhat unique, ie it's
not exactly identical to someone else's box (for comparison, Sun's 1U
Opteron box is exactly the same as a Newisys box, IBM's is exactly the
same as an MSI box, and Appro seems to sell one of everyone's setup).

Specs on the HP machines look a tiny bit more limited than some of the
others though. The IBM system has hot-swapable drive bays and 2 PCI-X
slots as compared to HP's non-hot-swap drive bays and only a single
PCI-X slot. The Newisys/Sun machines not only have the two PCI-X
slots and hot-swap drive bays but also their built-in system
management processor. Of course, price seems to reflect these
differences, with the HP systems being a reasonable amount cheaper
than the IBM systems, while Sun is the most expensive of the lot.

Gavin Scott · Feb 25, 2004

I find it interesting that HP chose to develop these AMD-based
systems when they probably knew that Intel planned to announce
their own AMD64-combatible CPUs in the immediate future. What,
if anything, does this say about the current state of the grand
HP/Intel alliance and IPF?

G.

The little lost angel · Feb 26, 2004

I find it interesting that HP chose to develop these AMD-based
systems when they probably knew that Intel planned to announce
their own AMD64-combatible CPUs in the immediate future. What,
if anything, does this say about the current state of the grand
HP/Intel alliance and IPF?

Maybe purely practical reasons? After all, Intel did say they aren't
going to enable those 64bitness for quite a while yet. It make sense
for HP to sell Opterons for half a year before switching to Prescott,
or offering them side by side.
--
L.Angel: I'm looking for web design work.
If you need basic to med complexity webpages at affordable rates, email me

Standard HTML, SHTML, MySQL + PHP or ASP, Javascript.
If you really want, FrontPage & DreamWeaver too.
But keep in mind you pay extra bandwidth for their bloated code

Adam Warner · Feb 26, 2004

Hi Gavin Scott,

I find it interesting that HP chose to develop these AMD-based systems
when they probably knew that Intel planned to announce their own
AMD64-compatible CPUs in the immediate future. What, if anything, does
this say about the current state of the grand HP/Intel alliance and IPF?

It can't be helping that HP's own benchmarketing shows their single CPU
AMD Opteron server is faster than their dual Intel Xeon server:
<http://h18004.www1.hp.com/products/servers/benchmarks/dl145-webbench.pdf>

Regards,
Adam

Rob Stow · Feb 26, 2004

Adam said:
Hi Gavin Scott,

It can't be helping that HP's own benchmarketing shows their single CPU
AMD Opteron server is faster than their dual Intel Xeon server:
<http://h18004.www1.hp.com/products/servers/benchmarks/dl145-webbench.pdf>

Even being the Opty fanboy that I am, when someone publishes
benchmarks showing that a single Opty 248 beating a 3.2 GHz
Xeon dualie the first word that comes to mind is "bullsh*t".

Sure, you can undoubtedly find some meaningless test where
you will get results like that, but otherwise those kinds
of results need to be treated with a healthy dose of skepticism.

Tony Hill · Feb 27, 2004

Even being the Opty fanboy that I am, when someone publishes
benchmarks showing that a single Opty 248 beating a 3.2 GHz
Xeon dualie the first word that comes to mind is "bullsh*t".

Sure, you can undoubtedly find some meaningless test where
you will get results like that, but otherwise those kinds
of results need to be treated with a healthy dose of skepticism.

It's not so much a meaningless test as one that doesn't scale very
well to multiple processors, particularly with the Xeons shared memory
architecture. The Xeon system only saw a 28% performance gain going
from one to two processors. Also, it shouldn't come as that big of a
surprise that the Opteron is strong here, the chip has consistently
outpaced the Xeon by large margins in basically all web server testing
I've seen. The combination of the integrated memory controller and
tons of low-latency/high-bandwidth I/O from hypertransport seems to be
a real winning combination for this sort of work.

In any case, you can find more info about WebBench here:

http://www.etestinglabs.com/benchmarks/webbench/default.asp

It's a bit of an all-encompassing web server test. HP doesn't break
down the individual client scores, just lists the two overall scores.
It's possible that the Opteron just did REALLY one in one test and
that was enough to push it's scores up overall.

Mitch Alsup · Feb 27, 2004

Rob Stow said:
Adam Warner wrote:
Even being the Opty fanboy that I am, when someone publishes
benchmarks showing that a single Opty 248 beating a 3.2 GHz
Xeon dualie the first word that comes to mind is "bullsh*t".

There is this thing called memory latency. Opteron has a lot
less of it with the on-chip memory controller than Zeon does
with the frontside bus.

Mitch

Rob Stow · Feb 27, 2004

Mitch said:
There is this thing called memory latency. Opteron has a lot
less of it with the on-chip memory controller than Zeon does
with the frontside bus.

I'm well aware of that. However, that conveys an advantage
that typically lets an Opty dualie beat out a Xeon dualie that
has a 50% higher cpu clock. This particular benchmark had
a single 2.2 GHz Opty beating - by a huge margin - a 3.2 GHz
Xeon dualie. I suspect the result was reported incorrectly -
it probably should have been a dualie vs dualie result.

Patrick Schaaf · Feb 27, 2004

[HP DL145 vs. DL140 webbench 5.0 results]

This particular benchmark had a single 2.2 GHz Opty beating - by a huge
margin - a 3.2 GHz Xeon dualie. I suspect the result was reported
incorrectly - it probably should have been a dualie vs dualie result.

For both systems, single and dual processor results were reported
and contrasted, so your "should probably have been" is probably
a bit unfounded.

Anybody here who is familiar with webbench 5.0, who could comment
on the relative importance of better memory controller, better
system interconnect, larger L1 caches, and/or double the L2 cache?

best regards
Patrick

Bernd Paysan · Feb 27, 2004

Rob said:
I'm well aware of that. However, that conveys an advantage
that typically lets an Opty dualie beat out a Xeon dualie that
has a 50% higher cpu clock. This particular benchmark had
a single 2.2 GHz Opty beating - by a huge margin - a 3.2 GHz
Xeon dualie. I suspect the result was reported incorrectly -
it probably should have been a dualie vs dualie result.

Why not? A dual Xeon has a single shared bus to the chipset. If you run two
memory-latency dependent programs on both Xeons, they'll go through the
same bottleneck; typically, you expect the same total throughput as with a
single Xeon running just one program. Now, on the Opty (nice nick ;-), you
have half the latency, and no shared bus, so a single Opty should get
double performance, and a double Opty should get four times (almost;
there's the round trip from the cache coherency).

We've got an Athlon 64 recently, and tried some benchmarks. With my own CPU
intensive microbenchmarks, the Athlon 64 is clock-by-clock as fast as the
old Athlon; nothing gained. However, with our applications (EDA CAD, e.g.
synthesis), there's a factor two difference. The most stunning experience
however is KDE 3.1. It's really fast on the Athlon 64, you barely notice
program startup time (it feels definitely faster than KDE 3.2 on an Athlon
XP, though the KDE people did tune a lot there). KDE program starting
definitely is a memory intensive job, latency bound (linking lots of shared
C++ libraries together). Also, starting up Cadence design framework
(exactly the same workload) was a lot faster than anywhere else.

I think the latency problem is a real one. You won't see it on SPEC, since
really very few SPEC programs are memory latency bound (and if they are,
people will hack the compiler to remove that). Real bloatware (and we have
to use real bloatware everyday, unfortunately) however is.

Tony Hill · Feb 28, 2004

I'm well aware of that. However, that conveys an advantage
that typically lets an Opty dualie beat out a Xeon dualie that
has a 50% higher cpu clock. This particular benchmark had
a single 2.2 GHz Opty beating - by a huge margin - a 3.2 GHz

That "huge margin" is only about 15%. Extremely impressive, but not
all that huge. This shouldn't be that big of a surprise though, the
Opteron has been pretty much destroying the Xeon in every web server
benchmark out there. Ace's did a fairly extensive set of tests here:

http://www.aceshardware.com/read.jsp?id=60000275

Here are a couple others from the past few months:

http://www.pcmag.com/article2/0,4149,1061586,00.asp
http://www.infoworld.com/article/03/08/01/30FE64linux_1.html

If you look at the SPECweb and SPECwebSSL results it becomes obvious
that the Opteron is not only pretty much owning the Xeon, but in fact
it tends to beat out all comers on a chip for chip basis. The only
processor that is competitive with the Opteron is the IBM Power4+, and
even here the 2.2GHz Opteron 848 chips in 4P configurations are faster
than the 1.7GHz Power4+ 4P setups. Note that software plays a big
role here, so an accurate comparison is a bit tough.

Xeon dualie. I suspect the result was reported incorrectly -
it probably should have been a dualie vs dualie result.

Err, than what in the hell was the second, higher set of Opteron
results for? The results are reported VERY clearly and the make
perfectly good sense.

The Opteron is simply THE chip to have for web serving, this HP test
is just the latest indication of that.

Mitch Alsup · Mar 1, 2004

Rob Stow said:
I'm well aware of that. However, that conveys an advantage
that typically lets an Opty dualie beat out a Xeon dualie that
has a 50% higher cpu clock.

Its not the CPU speed that makes or brakes this application
its memory latency. P4 could be running 40X Opteron and still
suffer the same defeat. Its memory latency not CPU speed; here.

This particular benchmark had
a single 2.2 GHz Opty beating - by a huge margin - a 3.2 GHz
Xeon dualie. I suspect the result was reported incorrectly -
it probably should have been a dualie vs dualie result.

I should point out that when the Opty is running in single CPU
mode, it gets better memory latency because it does not have to
wait for coherency to checkin before consuming data. This
improves single CPU memory latency, much the way parking on a bus
improves front side bus latency. In the single CPU configuration
Opty is getting around 60 ns latenncy to main memory measured at
the L2.

In dual processing mode, coherence traffic slows memory into the
100 ns range. However, with 2 memory controllers, there is twice
the memory throughput, of which Opty take good advantage.

Mitch

Rob Stow · Mar 1, 2004

Mitch said:
I should point out that when the Opty is running in single CPU
mode, it gets better memory latency because it does not have to
wait for coherency to checkin before consuming data. This
improves single CPU memory latency, much the way parking on a bus
improves front side bus latency. In the single CPU configuration
Opty is getting around 60 ns latenncy to main memory measured at
the L2.

In dual processing mode, coherence traffic slows memory into the
100 ns range. However, with 2 memory controllers, there is twice
the memory throughput, of which Opty take good advantage.

I thought that was only supposed to be a significant factor
when one Opty accessed memory "attached" to another Opty ?
(Or in the case of the secondary Opty in a dualie
with the RAM all attached only to the primary Opty.)

Robert Klute · Mar 2, 2004

I thought that was only supposed to be a significant factor
when one Opty accessed memory "attached" to another Opty ?
(Or in the case of the secondary Opty in a dualie
with the RAM all attached only to the primary Opty.)

Single processor latency is 80 nsec.
Dual processor local access is 100 nsec.
Dual processor remote access is 115 nsec.

Even if the memory is local, when you go to a multi processor
configuration it has to to the snoop for stale cache.

Tony Hill · Mar 2, 2004

I should point out that when the Opty is running in single CPU
mode, it gets better memory latency because it does not have to
wait for coherency to checkin before consuming data. This
improves single CPU memory latency, much the way parking on a bus
improves front side bus latency. In the single CPU configuration
Opty is getting around 60 ns latenncy to main memory measured at
the L2.

Err, no. The coherency checks and memory access are done
concurrently. Since the cache checks are WAY faster than a read from
main memory they always return first. If a remote cache has a newer
copy of the data, it is used and the memory read is canceled. If not,
the memory read continues as normal. There might be an extra ns or
two of latency, but nothing significant.

What IS significant is that without any NUMA optimizations a
dual-processor Opteron system accesses 50% of it's data from a remote
memory controller. Even with NUMA optimizations this number is still
pretty high. AMD estimates the latency penalty for remote memory as
being 35ns for one hop and another 40ns for two hops.

Tony

Andi Kleen · Mar 2, 2004

Tony Hill said:
Err, no. The coherency checks and memory access are done
concurrently. Since the cache checks are WAY faster than a read from
main memory they always return first. If a remote cache has a newer
copy of the data, it is used and the memory read is canceled. If not,
the memory read continues as normal. There might be an extra ns or
two of latency, but nothing significant.

Mitch's description was correct as far as I know. The differences in
base memory latency between 1,2,4 way opteron are easily measurable
using standard tools (lmbench). They are also clearly documented in
numerous AMD presentations.

I guess it does not overlap the cache coherency with the memory access
to not stress the memory controller with unnecessary reads. This makes
sense since cache hits normally occur much more frequently than cache
misses. This gives an 4 CPU Opteron system effectively a 4MB multi
level cache in front of the memory controller. I suspect a memory
access that has already reached a DIMM is hard to cancel.

-Andi

The little lost angel · Mar 2, 2004

misses. This gives an 4 CPU Opteron system effectively a 4MB multi
level cache in front of the memory controller. I suspect a memory
access that has already reached a DIMM is hard to cancel.

Hmm, I'm no chip engineer, but couldn't the Opteron just go onto the
next step with the cache read data and ignoring the results (and
therefore the delay) from the DIMM read?

--
L.Angel: I'm looking for web design work.
If you need basic to med complexity webpages at affordable rates, email me

Standard HTML, SHTML, MySQL + PHP or ASP, Javascript.
If you really want, FrontPage & DreamWeaver too.
But keep in mind you pay extra bandwidth for their bloated code

Rick Jones · Mar 2, 2004

In comp.arch Tony Hill said:
If you look at the SPECweb and SPECwebSSL results it becomes obvious
that the Opteron is not only pretty much owning the Xeon, but in
fact it tends to beat out all comers on a chip for chip basis. The
only processor that is competitive with the Opteron is the IBM
Power4+, and even here the 2.2GHz Opteron 848 chips in 4P
configurations are faster than the 1.7GHz Power4+ 4P setups. Note
that software plays a big role here, so an accurate comparison is a
bit tough.

Not meaning anything but to point-out other SPECweb99* considerations:

*) For SPECweb99_SSL, use of a hardware crytpo accelerator also makes
comparisons difficult.

*) For SPECweb99, use of large send (NIC TCP segmentation, what Linux
I believe calls TSO, hope that doesn't start another offload
discussion

makes comparison more of a challenge. It may or may not
matter much for SPECweb99_SSL - the additional present of hardware
crypto likely affects that.

I believe that for the PCI-X GbE NICs under AIX, large send is enabled
by default. Any crypto accelerator HW would be listed in a disclosure.

At some point, what a vendor calls a 'CPU' probably makes life
interesting as well. Just as threading already does to an extent.

rick jones

HP's 2-way Opteron server

Yousuf Khan

Yousuf Khan

Tony Hill

Gavin Scott

The little lost angel

Adam Warner

Rob Stow

Tony Hill

Mitch Alsup

Rob Stow

Patrick Schaaf

Bernd Paysan

Tony Hill

Mitch Alsup

Rob Stow

Robert Klute

Tony Hill

Andi Kleen

The little lost angel

Rick Jones