IBM Hurricane chipset leads x86 tpmc 4-way

  • Thread starter Thread starter Robert Myers
  • Start date Start date
The claimed reductions in latency, "from 265 nanoseconds to 108
nanoseconds," are hard to argue away. If that's an accurate measure
of relative latency and not a hyped-up marketing claim, it will have a
big impact.

If it's remotely accurate, then yes. Unfortunately they really didn't
provide any context for this. Is this in comparison to the previous
IBM chipset? And is this just straight latency to memory for a single
chip on a single access or some sort of average? If it's just
straight latency than the original 265ns number was pretty weak to
begin with, Intel's latest desktop chipsets are down under 100ns and
their servers should be somewhere around 130-150ns (though I haven't
seen many tests for the latter).
"One and only?" Not likely. Most significant? From looking at the
effects of latency on tpc-c in other situations, I'd bet it is. To
see what Intel has as a counter we will, indeed, have to wait and see.

64-bit support should offer about a 10% improvement all on it's own.
Combine that with a 20% increase in clock speed and a 66% faster
system bus... Also, if I understand the whole "dual bus" idea
properly (ie 2 buses with 2 CPUs connected to each one in a 4P system
vs. 4 processors on a single bus in a traditional Xeon system) I think
this could make up for a lot of the difference as well. This is
exactly how Intel's new E8500 chipset's "dual bus" design operates as
well.

A nice chipset to be sure, but I think people are singing the praises
a bit too much and too soon. My guess is that it's only going to end
up being no more than 5% faster than Intel's new E8500 chipset. Sure,
it's a hell of a lot faster than the previous generation of
chipset/processor combination, but it's how this compares to current
chipset/processor combos that matters. IBM just happened to be first
out the door with benchmarks this time around, but I expect others to
follow suit soon enough.
 
If it's remotely accurate, then yes. Unfortunately they really didn't
provide any context for this. Is this in comparison to the previous
IBM chipset? And is this just straight latency to memory for a single
chip on a single access or some sort of average? If it's just
straight latency than the original 265ns number was pretty weak to
begin with, Intel's latest desktop chipsets are down under 100ns and
their servers should be somewhere around 130-150ns (though I haven't
seen many tests for the latter).
Since the quote didn't provide nearly enough information to interpret
the latency claims as absolute numbers, I was careful to characterize
it as a measure of relative latency. I'm assuming that IBM would have
the integrity to do an apples-to-apples comparison with their own
hardware, no matter what the absolute numbers may mean. Were the
project manager in marketing, he might have been shrewd enough to say
that they shaved over a hundred nanoseconds off the chipset latency.
I'd be reluctant to say that IBM server chipsets had high latency for
server chipsets based on that soundbyte. You can do as you please,
but see another comparison to a previous Summit generation below.
64-bit support should offer about a 10% improvement all on it's own.
Combine that with a 20% increase in clock speed and a 66% faster
system bus... Also, if I understand the whole "dual bus" idea
properly (ie 2 buses with 2 CPUs connected to each one in a 4P system
vs. 4 processors on a single bus in a traditional Xeon system) I think
this could make up for a lot of the difference as well. This is
exactly how Intel's new E8500 chipset's "dual bus" design operates as
well.
I'd be surprised to learn that server applications are driving
frontside bus bandwidth requirements. One of the reasons you can get
away with hanging so much hardware off a frontside bus in server
applications is that server CPU's spend so much of their time stalled
for memory--a latency, not a bandwidth, problem. Predictable,
computationally-intensive calculations are typically the most
demanding of bandwidth.
A nice chipset to be sure, but I think people are singing the praises
a bit too much and too soon. My guess is that it's only going to end
up being no more than 5% faster than Intel's new E8500 chipset. Sure,
it's a hell of a lot faster than the previous generation of
chipset/processor combination, but it's how this compares to current
chipset/processor combos that matters. IBM just happened to be first
out the door with benchmarks this time around, but I expect others to
follow suit soon enough.
IBM may have taken a lesson from HP:

http://www.lostcircuits.com/tradeshow/idf_2002/4.shtml

<quote>

The server market with its higher longevity of equipment was hurt even
worse than the desktop market, in addition, the platforms available
for the IPF were designed for future scalability and expandability and
somewhat missed the current economic requirements. Examples are the
i870 and the IBM EXA (Summit) chipsets geared towards the very
high-end and comparable with 80,000 lbs trucks. To drop a bomb into
this scenario, Hewlett Packard showcased their zx1, comparable with a
high performance street bike to outrun the competition before they
even know what hit them.

The concept is fairly simple. Take the IPF 64 bit architecture, pare
it free of all excessive fat and provide a platform suitable for both
IA64 as well as for the IPF-compatible PA-RISC processor line.
Features trimmed off comprise the 32 MB L4 cache (IBM EXA), Memory
Mirroring to ensure hot-swapping of DIMMs and x-way scalability. The
result is an up to 4-way scalable platform with enhanced ECC or rather
memory protection to allow Chip Kill. Heart of the chipset is the zx1
Memory & I/O controller featuring eight I/O links to PCI and PCI-X as
well as AGP-4X (to be upgraded to AGP-8X). On the other side, the zx1
controller offers links to no less than 12 memory expander chips
capable of handling up to 64 DIMMs for 128 GB of system memory. Memory
bandwidth scales from 8.5 GB/s in direct-attached designs (without the
optional expanders) to 12.8 GB/s using the expander chips that further
act like registers to decrease the signal load on the memory bus.

This is, however, not the key advantage of the zx1. Because of the
high complexity and scalability, the i870 and EAX chipset are
relatively slow. That is, in addition to the 32 ns latency intrinsic
to McKinley for each memory access, the arbitration within the complex
maze of superscalable interconnections cause another roughly 270 ns
latency until the requested data get back to the processor, so we are
talking about a total of 300 ns access time for a memory request. The
zx1 on the other hand manages to do the quarter mile in 11.2 seconds,
er, make that 112 ns for the memory access latency which is almost 3
times as fast (in direct-attached configurations). Adding the expander
chips costs another 25 ns but compensates with higher bandwidth and
the zx1 is still about twice as fast as the competition.

</quote>

That may also help to put the stated latencies into some perspective
(previous generation Summit compared to zx1 in almost the same way).
Notice the disappearance of the L4 cache (and X3 does away with L3, as
well). A three-year program from IBM? The timing is just about
right.

Intel can design a chip that will come close in performance? I'm sure
they can. Will they? Intel's track record on chipsets has been
spotty (to be charitable, at that).

The only real problem left in computation is getting the data where
you want it when you need it. The parts that do the computing are
almost afterthoughts compared to the machinery dedicated to getting
instructions and data to arrive on time and coping with what happens
when they don't. It's about time the memory subsystem got more
attention, and I hope this isn't the end of it.

.... Of course, you could rid yourself of most of these problems
entirely by changing the whole computing paradigm, but that's for
another thread.

RM
 
The roadmaps are loaded with sacrificial elements.
P4 is a Dead Chip Walking...
Sacrificial, or butt-covering and misleading? Intel seems to have
made more out of Prescott than one might have imagined, given how
disappointing the first results were and how quickly they backed away.
One might have imagined a world of Xeon firesales in a market being
swept by Opteron. Xeon has taken a licking, but it's still ticking.

If Intel really can scale Prescott to 65nm, that would be news, I
think, since it seems like they barely got it to work at 90nm. Or
maybe they've made significant progress or maybe they think they'll
make significant progress by the time the chips are to be released.

If they do throw NetBurst overboard, they'll have alot of explaining
to do, I think. "You remember that architecture you liked so much,
you know, the Pentium IIII, well we've listened to you and we've
decided that the right thing to do is to give the market what it
really wanted all along, only better." I'll bet Pentium M isn't a
superstar on SpecFP. Not for nothing are the guys in marketing at
Intel so important. The guys who tweak icc for the benchmarks
probably earn their money, too.

I wonder if Intel knows what it's really going to release.

RM
 
In comp.sys.ibm.pc.hardware.chips Robert Myers said:
There is an interview with an IBM project manager at
http://www.techworld.com/opsys/features/index.cfm?FeatureID=1204
<quote>
At design time, there was a maniacal focus on latency reduction. When
you can cut the time it takes it gets from one point to the next you
can increase performance, so chipset latency has been cut by two and
half times -- down from 265 nanoseconds to 108 nanoseconds.

This is extremely important. Latency improvement have lagged
horribly (early PCs latency was less than 1 byte bandwidth fetch,
current machines are waiting 300+ bytes -- 256byte cachelines, anyone? :)

Latency elements have been consuming more calc time and are more
important for performance improvements. Especially TPMC wich AFAIK
is a relational database benchmark with linked-lists that boils
down to a massive pointer chasing exercise governed by latency.

I'd like to see how some of the AMD K8s with on-CPU memory
controllers do on TPMC.

-- Robert
 
This is extremely important. Latency improvement have lagged
horribly (early PCs latency was less than 1 byte bandwidth fetch,
current machines are waiting 300+ bytes -- 256byte cachelines, anyone? :)

Wasn't a part of Netburst the 128byte L2 cache line?:-) That is two
sectors of 64bytes of course.
Latency elements have been consuming more calc time and are more
important for performance improvements. Especially TPMC wich AFAIK
is a relational database benchmark with linked-lists that boils
down to a massive pointer chasing exercise governed by latency.

I'd like to see how some of the AMD K8s with on-CPU memory
controllers do on TPMC.

The fastest one listed here:
http://www.tpc.org/tpcc/results/tpcc_results.asp?print=false&orderby=tpm&sortby=desc
AFAICT is the HP ProLiant DL585/2.6GHz - does OK but nothing spectacular
and as already pointed out the Hurricane based IBM eServer xSeries 366
whacks it, though DB2 (vs. SQL Server) contributes a good amount (~half) of
the difference.

OTOH there are only HP and a couple of obsolete Racksaver Opteron systems
listed... and the above IBM system is close to $1M, so ~2.7 times the cost
of the HP Opteron system. As already noted, Sun is notably absent and IBM
apparently does not market Opteron into this market.
 
In comp.sys.ibm.pc.hardware.chips George Macdonald said:
The fastest one listed here:
http://www.tpc.org/tpcc/results/tpcc_results.asp?print=false&orderby=tpm&sortby=desc
AFAICT is the HP ProLiant DL585/2.6GHz - does OK but nothing

Yes, that is a 4-way Opteron. I fear that such a setup would
require a Northbridge and eliminate the single-thread latency
advantage of an on-CPU memory controller. Does anyone know?

I very much like SMP, but I think I like on-CPU memory
controllers even more. Maybe like Tony I should wait for
dual cores before I replace my aging BP6 (dual Celerons)

-- Robert
 
Well that was because those Ultrasparcs couldn't compete against
anybody either on absolute perf or price/perf. Nowadays, they have a
compelling price/perf story, so you'll likely see them publish
again.

If we take the premis of the non-competitiveness of the UltraSPARCs as
truth (since I post from .hp.com I have to be a bit circumspect :)
wouldn't Sun still have "issues" with the comparison of Opteron to
UltraSPARC? If the benchmark were suddenly "reformed" and so OK to
publish using Sun, Opteron-based systems, it would seem to continue to
beg the question of why it is not suited for UltraSPARC systems.

rick jones
 
Yes, that is a 4-way Opteron. I fear that such a setup would
require a Northbridge and eliminate the single-thread latency
advantage of an on-CPU memory controller. Does anyone know?

Eliminate?... Compomrise?:-) It's not what you'd call a "turkey", with
performance just 10K points lower than the Hurricane system... and at 1/2.7
the cost, it's certainly a bargain. You mean "require a Northbridge" to
get better performance as opposed to the Hypertransport links with
worst-case 2-hop memory accesses? I don't know how practical it would be
but IMO AMD should look to bumping performance on the local HT links.
I very much like SMP, but I think I like on-CPU memory
controllers even more. Maybe like Tony I should wait for
dual cores before I replace my aging BP6 (dual Celerons)

Sounds like a plan. It's not clear to me from the roadmaps how long a
socket 939/940 dual core will exist - there seems to be some overlap with
the socket M2 chips and DDR-II memory controllers. Could be there's going
to be a window of err, opportunity.
 
In said:
Eliminate?... Compomrise?:-) It's not what you'd call a "turkey",
with performance just 10K points lower than the Hurricane
system... and at 1/2.7 the cost, it's certainly a bargain.
You mean "require a Northbridge" to get better performance

Of course SMP requires a Northbridge for better overall
SMP performance. Mostly by keeping banks open and running
concurrent precharges. But a Northbridge _cannot_ improve a
single random fetch. It's just silicon in the way, and will
want to buffer or queue.
as opposed to the Hypertransport links with worst-case 2-hop
memory accesses? I don't know how practical it would be but
IMO AMD should look to bumping performance on the local HT links.

2 hop? Sounds ugly. Under what circumstances?

Frankly, I'm a little surprised no-one runs any latency
benchmarks on RAM. A little pointer-chasing exercise isn't
hard to write, and would be very revealing.

Hey, I resemble that remarque :) Maybe I should go write one!

-- Robert
 
Of course SMP requires a Northbridge for better overall
SMP performance. Mostly by keeping banks open and running
concurrent precharges. But a Northbridge _cannot_ improve a
single random fetch. It's just silicon in the way, and will
want to buffer or queue.


2 hop? Sounds ugly. Under what circumstances?

When you have 4 CPUs interconnected with 3 HT-links each and at least one
of those has to be used for I/O, some of the accesses have to involve two
hops.
Frankly, I'm a little surprised no-one runs any latency
benchmarks on RAM. A little pointer-chasing exercise isn't
hard to write, and would be very revealing.

Dave Wang has discussed it in some detail - one of his pet subjects I
believe. IIRC he was measuring round-trip times the "hard" way with
probes.
 
Since the quote didn't provide nearly enough information to interpret
the latency claims as absolute numbers, I was careful to characterize
it as a measure of relative latency. I'm assuming that IBM would have
the integrity to do an apples-to-apples comparison with their own
hardware, no matter what the absolute numbers may mean. Were the
project manager in marketing, he might have been shrewd enough to say
that they shaved over a hundred nanoseconds off the chipset latency.
I'd be reluctant to say that IBM server chipsets had high latency for
server chipsets based on that soundbyte. You can do as you please,
but see another comparison to a previous Summit generation below.

I would say that it's safe to assume IBM has sufficient integrity to
do an apples-to-apples comparison. However I just wanted to point out
that their reduction in latency has happened at about the same time
that everyone else in the industry has also been working hard to
reduce latency in chipsets. Given the numbers it looks like IBM has
been more successful than anyone else, cutting 150ns off memory
latency is very impressive. Other companies (in particular Intel and
nVidia on the desktop side and presumably Intel on the server side as
well) only managed about a 100ns reduction in the same time frame.
I'd be surprised to learn that server applications are driving
frontside bus bandwidth requirements. One of the reasons you can get
away with hanging so much hardware off a frontside bus in server
applications is that server CPU's spend so much of their time stalled
for memory--a latency, not a bandwidth, problem.

There are limits to everything, and remember that the P4/Xeon core
seems to be rather bandwidth-hungry. Keep in mind that the old Xeons
had 4 processors hanging off a single 400MT/s, 64-bit wide bus. That
was only 3.2GB/s of memory bandwidth for 4 cores running at 3.0GHz.
You don't need very high bandwidth requirements before that became a
bottleneck.
Predictable,
computationally-intensive calculations are typically the most
demanding of bandwidth.

Indeed, and if SPEC CFP2000_rate scores are anything to go by, the old
4P XeonMP systems absolutely sucked in such situations.
That may also help to put the stated latencies into some perspective
(previous generation Summit compared to zx1 in almost the same way).
Notice the disappearance of the L4 cache (and X3 does away with L3, as
well). A three-year program from IBM? The timing is just about
right.

Yup, sounds reasonable.
Intel can design a chip that will come close in performance? I'm sure
they can. Will they? Intel's track record on chipsets has been
spotty (to be charitable, at that).

Indeed, particularly for server chipsets. However, one would assume
that they DO have the resources and know-how to design such a chipset
if they felt it was needed. They never seemed to worry much about
latency on their desktop chipsets until the i865/i875, but there they
managed to cut ~100ns off the latency of these chips when compared to
the previous generation.
The only real problem left in computation is getting the data where
you want it when you need it. The parts that do the computing are
almost afterthoughts compared to the machinery dedicated to getting
instructions and data to arrive on time and coping with what happens
when they don't. It's about time the memory subsystem got more
attention, and I hope this isn't the end of it.

I would say that it's only the beginning. The logical next step is to
integrate the memory controller right onto the CPU itself...
... Of course, you could rid yourself of most of these problems
entirely by changing the whole computing paradigm, but that's for
another thread.

I'll leave that thread to you! :>
 
Dave Wang has discussed it in some detail - one of his pet
subjects I believe. IIRC he was measuring round-trip times
the "hard" way with probes.

Well, the deed is done (code below). Perhaps not as sharp as
bus-snooping, but at least this gives program-visible read latency:

Latency System CPU@MHz mem.ctl RAM
ns

144 P3@1000 laptop SO-PC133?
148 2*P3@860 Serverworks ??
178 P4@1800 i850 RDRAM
184 K7@1667 SiS735 PC133
185 P3@600 440BX PC100
217 2*Cel@500 440BX PC90
234 P2@350 440BX PC100?
288 P2@333 440BX PC66

I do need to find & test some more modern systems, but I'm
underwhelmed by the slowness of latency improvement.



compile: $ gcc -O2 lat10m.c
run: $ time ./a.out [multiply user time by 100 to give ns]

/* lat10m.c - Measure latency of 10 million fresh memory reads
(C) Copyright 2005 Robert Redelmeier - GPL v2.0 licence granted */
int p[ 1<<21 ] ;
main (void) {
int i, j ;
for ( i=0 ; i < 1<<21 ; i++ ) p = 0x1FFFFF & (i-5000) ;
for ( j=i=0 ; i < 9600000 ; i++ ) j = p[j] ;
return j ; }


-- Robert
 
Back
Top