Roadrunner Supercomputer using 12,960 CELL Processors Hits 1 PetaFlop(1000 TeraFlops) of double-prec

  • Thread starter Thread starter AirRaid
  • Start date Start date
A

AirRaid

Yes, this is cross-posted, because IMO it deserves to be. This is a
milestone in computing performance, thus, IMO, it should be as widely
posted as possible.

This isn't spam whatsoever, so if you want to get pissed that it's a
cross post, okay,
just don't think it's spam.

I posted about the Roadrunner supercomputer back in September 2006
when it was announced. Now it's up and running.




http://graphics8.nytimes.com/images/2008/06/09/business/09petaflop.enlarge.jpg

Military Supercomputer Sets Record

By JOHN MARKOFF
Published: June 9, 2008

SAN FRANCISCO — An American military supercomputer, assembled from
components originally designed for video game machines, has reached a
long-sought-after computing milestone by processing more than 1.026
quadrillion calculations per second.


The Roadrunner supercomputer costs $133 million and will be used to
study nuclear weapons.

The new machine is more than twice as fast as the previous fastest
supercomputer, the I.B.M. BlueGene/L, which is based at Lawrence
Livermore National Laboratory in California.

The new $133 million supercomputer, called Roadrunner in a reference
to the state bird of New Mexico, was devised and built by engineers
and scientists at I.B.M. and Los Alamos National Laboratory, based in
Los Alamos, N.M. It will be used principally to solve classified
military problems to ensure that the nation’s stockpile of nuclear
weapons will continue to work correctly as they age. The Roadrunner
will simulate the behavior of the weapons in the first fraction of a
second during an explosion.

Before it is placed in a classified environment, it will also be used
to explore scientific problems like climate change. The greater speed
of the Roadrunner will make it possible for scientists to test global
climate models with higher accuracy.

To put the performance of the machine in perspective, Thomas P.
D’Agostino, the administrator of the National Nuclear Security
Administration, said that if all six billion people on earth used hand
calculators and performed calculations 24 hours a day and seven days a
week, it would take them 46 years to do what the Roadrunner can in one
day.

The machine is an unusual blend of chips used in consumer products and
advanced parallel computing technologies. The lessons that computer
scientists learn by making it calculate even faster are seen as
essential to the future of both personal and mobile consumer
computing.

The high-performance computing goal, known as a petaflop — one
thousand trillion calculations per second — has long been viewed as a
crucial milestone by military, technical and scientific organizations
in the United States, as well as a growing group including Japan,
China and the European Union. All view supercomputing technology as a
symbol of national economic competitiveness.

By running programs that find a solution in hours or even less time —
compared with as long as three months on older generations of
computers — petaflop machines like Roadrunner have the potential to
fundamentally alter science and engineering, supercomputer experts
say. Researchers can ask questions and receive answers virtually
interactively and can perform experiments that would previously have
been impractical.

“This is equivalent to the four-minute mile of supercomputing,” said
Jack Dongarra, a computer scientist at the University of Tennessee who
for several decades has tracked the performance of the fastest
computers.

Each new supercomputing generation has brought scientists a step
closer to faithfully simulating physical reality. It has also produced
software and hardware technologies that have rapidly spilled out into
the rest of the computer industry for consumer and business products.

Technology is flowing in the opposite direction as well. Consumer-
oriented computing began dominating research and development spending
on technology shortly after the cold war ended in the late 1980s, and
that trend is evident in the design of the world’s fastest computers.

The Roadrunner is based on a radical design that includes 12,960 chips
that are an improved version of an I.B.M. Cell microprocessor, a
parallel processing chip originally created for Sony’s PlayStation 3
video-game machine. The Sony chips are used as accelerators, or
turbochargers, for portions of calculations.

The Roadrunner also includes a smaller number of more conventional
Opteron processors, made by Advanced Micro Devices, which are already
widely used in corporate servers.

“Roadrunner tells us about what will happen in the next decade,” said
Horst Simon, associate laboratory director for computer science at the
Lawrence Berkeley National Laboratory. “Technology is coming from the
consumer electronics market and the innovation is happening first in
terms of cellphones and embedded electronics.”

The innovations flowing from this generation of high-speed computers
will most likely result from the way computer scientists manage the
complexity of the system’s hardware.

Roadrunner, which consumes roughly three megawatts of power, or about
the power required by a large suburban shopping center, requires three
separate programming tools because it has three types of processors.
Programmers have to figure out how to keep all of the 116,640
processor cores in the machine occupied simultaneously in order for it
to run effectively.

“We’ve proved some skeptics wrong,” said Michael R. Anastasio, a
physicist who is director of the Los Alamos National Laboratory. “This
gives us a window into a whole new way of computing. We can look at
phenomena we have never seen before.”

Solving that programming problem is important because in just a few
years personal computers will have microprocessor chips with dozens or
even hundreds of processor cores. The industry is now hunting for new
techniques for making use of the new computing power. Some experts,
however, are skeptical that the most powerful supercomputers will
provide useful examples.

“If Chevy wins the Daytona 500, they try to convince you the Chevy
Malibu you’re driving will benefit from this,” said Steve Wallach, a
supercomputer designer who is chief scientist of Convey Computer, a
start-up firm based in Richardson, Tex.

Those who work with weapons might not have much to offer the video
gamers of the world, he suggested.

Many executives and scientists see Roadrunner as an example of the
resurgence of the United States in supercomputing.

Although American companies had dominated the field since its
inception in the 1960s, in 2002 the Japanese Earth Simulator briefly
claimed the title of the world’s fastest by executing more than 35
trillion mathematical calculations per second. Two years later, a
supercomputer created by I.B.M. reclaimed the speed record for the
United States. The Japanese challenge, however, led Congress and the
Bush administration to reinvest in high-performance computing.

“It’s a sign that we are maintaining our position,“ said Peter J.
Ungaro, chief executive of Cray, a maker of supercomputers. He noted,
however, that “the real competitiveness is based on the discoveries
that are based on the machines.”

Having surpassed the petaflop barrier, I.B.M. is already looking
toward the next generation of supercomputing. “You do these record-
setting things because you know that in the end we will push on to the
next generation and the one who is there first will be the leader,”
said Nicholas M. Donofrio, an I.B.M. executive vice president.

By breaking the petaflop barrier sooner than had been generally
expected, the United States’ supercomputer industry has been able to
sustain a pace of continuous performance increases, improving a
thousandfold in processing power in 11 years. The next thousandfold
goal is the exaflop, which is a quintillion calculations per second,
followed by the zettaflop, the yottaflop and the xeraflop.

http://www.nytimes.com/2008/06/09/technology/09petaflops.html?hp


other articles:

http://www.engadget.com/2008/06/09/worlds-fastest-ibms-roadrunner-supercomputer-breaks-petaflop/
http://www.theregister.co.uk/2008/06/09/roadrunner_supercomputer_debut/
http://www.itproportal.com/articles...drunner-supercomputer-break-petaflop-barrier/
http://www.theinquirer.net/gb/inquirer/news/2008/06/09/ibm-helps-military-achieve
 
Yes, this is cross-posted, because IMO it deserves to be. This is a
milestone in computing performance, thus, IMO, it should be as widely
posted as possible.

This isn't spam whatsoever, so if you want to get pissed that it's a
cross post, okay,
just don't think it's spam.

I posted about the Roadrunner supercomputer back in September 2006
when it was announced. Now it's up and running.


Nothing special about it tbh.
If we had moved into DNA, Chemical, or Optical processors to manage such a
task then I would say it is special..
The fact that they have managed to bolt on different types of processors
onto a board (think GPU, CPU, and MMPU) and then link the boards together
(Think NIC) really doesnt show any innovation.

I know my example isnt the same as what we have going on in these servers,
but in essence, anyone who wants to throw 400m into a project can see them
double what this kit is doing, so it's nothing but a monetary step.
 
AirRaid said:
Yes, this is cross-posted, because IMO it deserves to be. This is a
milestone in computing performance, thus, IMO, it should be as widely
posted as possible.

This isn't spam whatsoever, so if you want to get pissed that it's a
cross post, okay,
just don't think it's spam.

I posted about the Roadrunner supercomputer back in September 2006
when it was announced. Now it's up and running.
Nothing special about it tbh.
If we had moved into DNA, Chemical, or Optical processors to manage such a
task then I would say it is special..
The fact that they have managed to bolt on different types of processors
onto a board (think GPU, CPU, and MMPU) and then link the boards together
(Think NIC) really doesnt show any innovation.

I know my example isnt the same as what we have going on in these servers,
but in essence, anyone who wants to throw 400m into a project can see them
double what this kit is doing, so it's nothing but a monetary step.

(damn your HTML original post)
 
Nothing special about it tbh.
If we had moved into DNA, Chemical, or Optical processors to manage such a
task then I would say it is special..
The fact that they have managed to bolt on different types of processors
onto a board (think GPU, CPU, and MMPU) and then link the boards together
(Think NIC) really doesnt show any innovation.

I know my example isnt the same as what we have going on in these servers,
but in essence, anyone who wants to throw 400m into a project can see them
double what this kit is doing, so it's nothing but a monetary step.

It's hard to know what your perspective could be. The general
question is: what could you do if you coupled a high-end out-of-order
general-purpose CPU with a beefy compute engine and serious
interconnect. Cell (or similar compute-intensive hardware) has
important implications for power consumption. Balancing the memory,
the bandwidth of all the interconnects, the switches, and the compute
density of each node (how many processors? connected how?) is more
than just buying a bunch of Macs and hooking them together with one
gigabyte ethernet.

I'd like to see a serious discussion of this machine, but this thread
with all these cross-posts isn't the right place to be doing it. The
one petabyte crap is pure national labs PR. The interesting stuff is
elsewhere. I'm sorry if you can't see that.

Robert.
 
It's hard to know what your perspective could be. The general
question is: what could you do if you coupled a high-end out-of-order
general-purpose CPU with a beefy compute engine and serious
interconnect. Cell (or similar compute-intensive hardware) has
important implications for power consumption. Balancing the memory,
the bandwidth of all the interconnects, the switches, and the compute
density of each node (how many processors? connected how?) is more
than just buying a bunch of Macs and hooking them together with one
gigabyte ethernet.

I'd like to see a serious discussion of this machine, but this thread
with all these cross-posts isn't the right place to be doing it. The
one petabyte crap is pure national labs PR. The interesting stuff is
elsewhere. I'm sorry if you can't see that.

Robert.

I'd agree with mr deo.
Roadrunner interconnects look like a big step backward from other PR-
heavy American supercomputers. Hopefully, they are not as boring as in
Virginia Tech cluster that you mentioned, but not in the same class as
Cray/Sandia Red Storm, SGI/NASA Columbia or IBM/LLNL BlueGene/L.

Also it looks like programming model for Roadrunner (3 separate
architecture and many programmer-visible levels of memory hierarchy)
is much harder to use effectively than just about any big
supercomputer built up to date.

Last, but not least, performance per watt and per cubic meter seem
seriously worse than BlueGene.
 
On the positive note, successful testing of the Roadrunner means that
IBM has the ability to manufacture a new variant of Cell with fully-
pipelined double-precision FPU in production quantity.
IBM web site indicates that a new engine is available to mere mortals:http://www-03.ibm.com/systems/bladecenter/hardware/servers/qs22/index...



Interesting thing about the IBM PowerXCell 8i Processor is that it
offers 4 to 5 (IBM says 5) times the double precision FP performance
of the original Cell Processor.

Depending on various factors such as having 7 or 8 SPEs active,
counting the PPE or not counting it, and clockspeed, the original CELL
could manage 218 to 256 to just under 300 GFLOPs of single precision
FP. When double precision is needed performance drops massively, down
to around 25 GFLOPs.

The IBM PowerXCell 8i is said to be capable of over 100 GFLOPs
double precision. That's a huge increase without adding more SPEs or
upping clockspeed.

PowerXCell 8i cannot be considered a next-generation CELL, only an
enhanced first-gen CELL.

IBM plans to put 32 SPEs on the next-gen CELL to hit 1 TFLOP (single
precision I would imagine) in a single chip by 2010. There was also
an official roadmap that showed a CELL with 64 SPEs on a process
smaller than 45nm (be it 32nm, 22nm, I don't know). I posted about
both in the past.

It's clear that the IBM-Toshiba-Sony CELL is proving to be much more
useful beyond PS3 than the Sony-Toshiba 'Emotion Engine' ever was,
which really had no use outside of PS2 and cheap, home-made university
"supercomputers' such as the one using 60 or 70 PS2s at UIC in IL.

Roadrunner is serious stuff, and it's only the beginning. In the next
decade we'll see more powerful supercomputers using next-gen CELLs.
 
Interesting thing about the IBM PowerXCell 8i Processor is that it
offers 4 to 5 (IBM says 5) times the double precision FP performance
of the original Cell Processor.

Depending on various factors such as having 7 or 8 SPEs active,
counting the PPE or not counting it, and clockspeed, the original CELL
could manage 218 to 256 to just under 300 GFLOPs of single precision
FP. When double precision is needed performance drops massively, down
to around 25 GFLOPs.

The IBM PowerXCell 8i is said to be capable of over 100 GFLOPs
double precision. That's a huge increase without adding more SPEs or
upping clockspeed.

PowerXCell 8i cannot be considered a next-generation CELL, only an
enhanced first-gen CELL.

IBM plans to put 32 SPEs on the next-gen CELL to hit 1 TFLOP (single
precision I would imagine) in a single chip by 2010. There was also
an official roadmap that showed a CELL with 64 SPEs on a process
smaller than 45nm (be it 32nm, 22nm, I don't know). I posted about
both in the past.

It's clear that the IBM-Toshiba-Sony CELL is proving to be much more
useful beyond PS3 than the Sony-Toshiba 'Emotion Engine' ever was,
which really had no use outside of PS2 and cheap, home-made university
"supercomputers' such as the one using 60 or 70 PS2s at UIC in IL.

Roadrunner is serious stuff, and it's only the beginning. In the next
decade we'll see more powerful supercomputers using next-gen CELLs.

Though the CELL may prove more useful, it still has some serious arch
issues that a lot of people don't like. Programming the thing is not
the easiest thing in the world to do (though parallel programming
models for CMPs are somewhat of an open issue). It's nice that they
will continue to push performance, but many GPUs will be well above
1TFLOPS SPFP by 2010... which means that obviously Larrabee will be
out then. While I'm glad that CELL came out as it's enlightened the
world and solved several of the multicore integration problems, it
simply doesn't seem like the chip of the future right now. I simply
haven't heard a whole lot of interest from the HPC community on CELL,
but you never know... I could be wrong.
 
I'd agree with mr deo.
Roadrunner interconnects look like a big step backward from other PR-
heavy American supercomputers.

http://www.lanl.gov/orgs/hpc/roadru...ng, Performance & Results/031908_RR_model.pdf

The predicted worst-case latency is about the same as Blue-Gene. Red
Storm routing/switching looks like Blue Gene. Columbia uses both
Infiniband and Numalink in a fat tree like Roadrunner.
Hopefully, they are not as boring as in
Virginia Tech cluster that you mentioned, but not in the same class as
Cray/Sandia Red Storm, SGI/NASA Columbia or IBM/LLNL BlueGene/L.
Could you be more specific than, "Hopefully they are not as boring?"
Blue Gene and Red Storm have an interconnect topology that's fine for
problems with good locality. Not so good for problems requiring
global communication. Columbia and Roadrunner use fat trees, much
better for the problems that interest me the most.
Also it looks like programming model for Roadrunner (3 separate
architecture and many programmer-visible levels of memory hierarchy)
is much harder to use effectively than just about any big
supercomputer built up to date.
Coprocessors are here to stay. At the moment, they will be painfully
visible to the programmer.
Last, but not least, performance per watt and per cubic meter seem
seriously worse than BlueGene.

Not important for the time being. FP-intensive coprocessors can be
very effective in flops/watt. Neither the OOO-processor nor the FP
coprocessor was optimized for power performance. The more important
point, for the moment, is to show that all those flops can actually be
put to use.

To get back to the previous poster's point, building a machine with
lots of flops requires only money. Actually using those flops for
something other than linpack is another matter.

Robert.
 
In comp.sys.super AirRaid said:
Yes, this is cross-posted, because IMO it deserves to be. This is a
milestone in computing performance, thus, IMO, it should be as widely
posted as possible.

This isn't spam whatsoever, so if you want to get pissed that it's a
cross post, okay,
just don't think it's spam.

I posted about the Roadrunner supercomputer back in September 2006
when it was announced. Now it's up and running.




http://graphics8.nytimes.com/images/2008/06/09/business/09petaflop.enlarge.jpg

Military Supercomputer Sets Record

By JOHN MARKOFF
Published: June 9, 2008

SAN FRANCISCO ? An American military supercomputer, assembled from
components originally designed for video game machines, has reached a
long-sought-after computing milestone by processing more than 1.026
quadrillion calculations per second.


The Roadrunner supercomputer costs $133 million and will be used to
study nuclear weapons.

The new machine is more than twice as fast as the previous fastest
supercomputer, the I.B.M. BlueGene/L, which is based at Lawrence
Livermore National Laboratory in California.

The new $133 million supercomputer, called Roadrunner in a reference
to the state bird of New Mexico, was devised and built by engineers
and scientists at I.B.M. and Los Alamos National Laboratory, based in
Los Alamos, N.M. It will be used principally to solve classified
military problems to ensure that the nation?s stockpile of nuclear
weapons will continue to work correctly as they age. The Roadrunner
will simulate the behavior of the weapons in the first fraction of a
second during an explosion.

I wonder what they really do with these computers.

I find it unlikely they still don't know how nuclear weapons work,
especially considering they're mature technology and they've been around
for decades.
 
http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/Kerbyson - RR Mode...

The predicted worst-case latency is about the same as Blue-Gene. Red
Storm routing/switching looks like Blue Gene. Columbia uses both
Infiniband and Numalink in a fat tree like Roadrunner.

My understanding of p.10 is that 2-way latency between SPEs on two
neighbor triblades is ~8 usec i.e. about the same as node-to-node
latency in the whole 65K-node machine.
I didn't find any estimates for worst-case latency in the 10K-node
configuration. My personal uneducated guess - 10 times worse than BG/L
in the worst case and 5 time worse in average loaded case.
 
My understanding of p.10 is that 2-way latency between SPEs on two
neighbor triblades is ~8 usec i.e. about the same as node-to-node
latency in the whole 65K-node machine.
I didn't find any estimates for worst-case latency in the 10K-node
configuration. My personal uneducated guess - 10 times worse than BG/L
in the worst case and 5 time worse in average loaded case.

I took the "worst case 2-way Infiniband" latency to be the worst case
for the mesh fabric. The advertised worst case for one version of
Blue Gene was, I think, 5 microseconds. The latency between the
Opteron and the Cell Processor is another matter. In any case, it has
nothing to do with the mesh fabric.

BG/L will do fine on some kinds of problems. Just not ones requiring
significant global communication. That was my beef about Blue Gene
and Red Storm.

Robert.
 
I took the "worst case 2-way Infiniband" latency to be the worst case
for the mesh fabric.

I read it as a latency within triblade that does not include fabric.
The advertised worst case for one version of
Blue Gene was, I think, 5 microseconds.

Full original version was closer to 9 microseconds. Since the
currently the machine is almost twice bigger than it was back then I'd
guess that today they are at 10 microseconds.
The latency between the
Opteron and the Cell Processor is another matter. In any case, it has
nothing to do with the mesh fabric.

That's the point. I saw nothing about latency/bandwidth
characteristics of the IB switches used in the roadrunner.
BG/L will do fine on some kinds of problems. Just not ones requiring
significant global communication. That was my beef about Blue Gene
and Red Storm.

It depends on the kind of communication.
If the communication consist mostly of small bi-directional messages
these machines seem to be much better than anything else in existence.
For the large bandwidth-bound messages on BG/L the picture is less
rosy. I didn't see numbers for Crays XTn of comparable size (not sure
they exist); in theory they should be significantly better than BG/L.
However, I do not see why the roadrunner should be any better for
bandwidth-bound global communication. If anything, I expect it to do
worse, esp. in the worst case.
 
Though the CELL may prove more useful, it still has some serious arch
issues that a lot of people don't like. Programming the thing is not the
easiest thing in the world to do (though parallel programming models for
CMPs are somewhat of an open issue). It's nice that they will continue
to push performance, but many GPUs will be well above 1TFLOPS SPFP by
2010..

How reasonable is it to complain about the ease of programming the CELL,
and in the very next sentence, go on about GPUs?
 
How reasonable is it to complain about the ease of programming the CELL,
and in the very next sentence, go on about GPUs?

It's a pretty reasonable assumption that given the recent and
significant architectural changes of GPUs that they will continue
becoming more general purpose and programmer friendly. I prolly don't
need to mention the name Larrabee. I've programmed for CELL, and I've
programmed in CUDA... my personal opinion is that GPUs in 2010 overall
will be more appealling. That's IMO, and of course opinions can be
debated and may one day be found wrong.
 
It depends on the kind of communication.
If the communication consist mostly of small bi-directional messages
these machines seem to be much better than anything else in existence.
For the large bandwidth-bound messages on BG/L the picture is less
rosy. I didn't see numbers for Crays XTn of comparable size (not sure
they exist); in theory they should be significantly better than BG/L.
However, I do not see why the roadrunner should be any better for
bandwidth-bound global communication. If anything, I expect it to do
worse, esp. in the worst case.

Well, actually, it looks like you're right. I calculated the
bisection of bandwidth of BG/L in the tens of millibytes/flop (because
of the geometry and large mesh), and it turns out that the bisection
bandwidth of Roadrunner could be no better than 10 millibytes/flop,
because of the number of flops loaded onto a single node by way of the
Cell processors (about 100 gigaflop DP per node) connected by a
infiniband interconnect (about 1 gigabyte per second) that is wimpy by
comparison. :-(

Using the fat-tree topology addresses the scalability of bisection
bandwidth, but the individual links aren't properly scaled to the
nodes.

Robert.
 
I wonder what they really do with these computers.

I find it unlikely they still don't know how nuclear weapons work,
especially considering they're mature technology and they've been around
for decades.

Try these google searchs: ~simulation site:lanl.gov
~simulation site:llnl.gov

I get over 30000 page hits. If you surf around and even ponder how to
zero in on the nuclear weapons work, you will have your answer.

Question: We've discovered a warehouse full of Viet Nam-era artillery
shells. Should we just ship them to a war zone, or count on them
working in case of war? I mean, we *do* know how artillery shells
work, don't we?

Robert.
 
Though the CELL may prove more useful, it still has some serious arch
issues that a lot of people don't like. Programming the thing is not
the easiest thing in the world to do (though parallel programming
models for CMPs are somewhat of an open issue). It's nice that they
will continue to push performance, but many GPUs will be well above
1TFLOPS SPFP by 2010... which means that obviously Larrabee will be
out then. While I'm glad that CELL came out as it's enlightened the
world and solved several of the multicore integration problems, it
simply doesn't seem like the chip of the future right now. I simply
haven't heard a whole lot of interest from the HPC community on CELL,
but you never know... I could be wrong.

AMD's RV770 GPU coming out this month is already at 1 TFLOP and the
R700 product with two RV770 GPUs on a single card (4870 X2) which is
due out this August or September should be around 2 TFLOP. Of
course this is a GPU (or GPGPUs) and is not as programmable as
CELL.

The Larrabee should however, change that. I think Larrabee and
anything like it, with a manycore architecture (beyond multicore) is
the future. It'll be interesting to see how the next-gen CELL
compares to Larrabee.
 
There are no nuclear weapons. It's a huge conspiracy. The fact is, we use
these computers to model human behavior to learn how to keep the ignorant
masses under our control. Have you noticed lately funny noises when you
talk on the phone? It's because we are monitoring you. You are close to
learning our secret, and we are keeping an eye on you.

Yes, Zootal, you and your co-conspirators are secretly in charge of
everthing. Or at least that's what *they* want you to think...
 
In comp.sys.ibm.pc.hardware.chips Cydrome Leader said:
I find it unlikely they still don't know how nuclear weapons
work, especially considering they're mature technology and
they've been around for decades.

A two-stage thermonuclear warhead is a surprisingly complex
device -- look up Teller-Ulam. To work properly, the design has to
transfer enough light energy to ignite and burn the fusion secondary
before the fission primary shock waves disassemble the device.
There's _lots_ to simulate here in at least 2D over many timeslices.

Yes, we know how to make them go bang. Just follow the recipe.
But we don't always know the critical parts of that recipe, and
what parts we could change.

In general, the whole field of Finite-Element computation is
still short of cycles and can swallow everything available.
Multi-CPU clusters are still being built. Imagine being able
to simulate vehicle collisions -- designers would be able
to determine where metal could be added or other changes to
improve occupant survival.

-- Robert
 
Back
Top