IBM to build Opteron-Cell hybrid supercomputer of 1 PetaFlop performance

  • Thread starter Thread starter AirRaidJet
  • Start date Start date
A

AirRaidJet

http://news.zdnet.com/2100-9584_22-6112439.html

IBM to build Opteron-Cell hybrid supercomputer
By Stephen Shankland, CNET News.com
Published on ZDNet September 5, 2006, 1:12 PM PT



IBM has won a bid to build a supercomputer called Roadrunner that will
include not just conventional Opteron chips but also the Cell processor
used in the Sony Playstation, CNET News.com has learned.

The supercomputer, for the Los Alamos National Laboratory, will be the
world's fastest machine and is designed to sustain a performance level
of a "petaflop," or 1 quadrillion calculations per second, said U.S.
Sen. Pete Domenici earlier this year. Bidding for the system opened in
May, when a congressional subcommittee allocated $35 million for the
first phase of the project, said Domenici, a Republican from New
Mexico, where the nuclear weapons lab is located.

Now sources familiar with the machine have said that IBM has won the
contract and that the National Nuclear Security Administration is
expected to announce the deal in coming days. The system is expected to
be built in phases, beginning in September and finishing by 2007 if the
government chooses build the full petaflop system.

There's plenty of competition in the high-end supercomputing race,
though. Japan's Institute of Physical and Chemical Research, called
RIKEN, announced in June that it had completed its Protein Explorer
supercomputer. The Protein Explorer reached the petaflop level, RIKEN
said, though not using the conventional Linpack supercomputing speed
test.

Representatives of IBM and Los Alamos declined to comment for this
story. The NNSA, which oversees U.S. nuclear weapons work at Los Alamos
and other sites, didn't immediately respond to a request for comment.

Hybrid supercomputers
The Roadrunner system, along with the Protein Explorer and the
seventh-fastest supercomputer, Tokyo Institute of Technology's Tsubame
system built by Sun Microsystems, illustrate a new trend in
supercomputing: combining general-purpose processors with
special-purpose accelerator chips.

"Roadrunner is emphasizing acceleration technologies. Coprocessor
acceleration is intrinsic to that particular design," said John
Gustafson, chief technology officer of start-up ClearSpeed
Technologies, which sells the accelerator add-ons used in the Tsubame
system. (Gustafson was referring to the Roadrunner project in general,
not to IBM's winning bid, of which he disclaimed knowledge.)

IBM's BladeCenter systems are amenable to the hybrid approach. A single
chassis can accommodate both general-purpose Opteron blade servers and
Cell-based accelerator systems. The BladeCenter chassis includes a
high-speed communications links among the servers, and one source said
the blades will be used in Roadrunner.

Advanced Micro Devices' Opteron processor is used in supercomputing
"cluster" systems that spread computing work across numerous small
machines joined with a high-speed network. In the case of Roadrunner,
the Cell processor, designed jointly by IBM, Sony and Toshiba, provides
the special-purpose accelerator.

Cell originally was designed to improve video game performance in the
PlayStation 3 console. The single chip's main processor core is
augmented by eight special-purpose processing cores that can help with
calculations such as simulating the physics of virtual worlds. Those
engines also are amenable to scientific computing tasks, IBM has said.

Using accelerators "expands dramatically" the amount of processing a
computer can accomplish for a given amount of electrical power,
Gustafson said.

"If we keep pushing traditional microprocessors and using them as
high-performance computing engines, they waste a lot of energy. When
you get to the petascale regions, you're talking tens of megawatts when
using traditional x86 processors" such as Opteron or Intel's Xeon, he
said.

"A watt is about a dollar a year if you have the things on all the
time," so 10 megawatts per year equates to $10 million in operating
expenses, Gustafson said.

A new partnership
The Los Alamos-IBM alliance is noteworthy for another reason as well.
The Los Alamos lab has traditionally favored supercomputers from
manufacturers other than IBM, including Silicon Graphics, Compaq and
Linux Networx. Its sister lab and sometimes rival, Lawrence Livermore,
has had the Big Blue affinity, housing the current top-ranked
supercomputer, Blue Gene/L.

Los Alamos also houses earlier Big Blue behemoths such as ASC Purple,
ASCI White and ASCI Blue Pacific. (ASCI stood for the Accelerated
Strategic Computing Initiative, a federal effort to hasten
supercomputing development to perform nuclear weapons simulation work,
but has since been modified to the Advanced Simulation and Computing
program.)

Blue Gene/L has a sustained performance of 280 teraflops, just more
than one-fourth of the way to the petaflop goal.

The U.S. government has become an avid supercomputer customer, using
the machines for simulations to ensure nuclear weapons will continue to
work even as they age beyond their original design lifespans. Such
physics simulations have grown increasingly sophisticated, moving from
two to three dimensions, but more is better. Los Alamos expects
Roadrunner will increase the detail of simulations by a factor of 10,
one source said.

For twice-yearly ranking of supercomputers called the Top500 list,
computers are ranked on the basis of a benchmark called Linpack that
measures how many floating-point operations per second--"flops"--it can
perform. Linpack is a convenient but incomplete representation of a
machine's total ability, but it's nevertheless widely watched.

IBM has dominated the Top500 list with its Blue Gene/L supercomputing
designs. But U.S. models haven't always led, and there's been some
international rivalry: A Japanese system, NEC's Earth Simulator, topped
the list for years.

IBM and petaflop computing are no strangers. Although customers can buy
the current Blue Gene/L systems or rent their processing power from
IBM, Blue Gene actually began as a research project in 2000 to reach
the petaflop supercomputing level.
 
Hmm, I wonder if this is part of AMD's Torenza initiative? That is, is
the Cell processor going to use Coherent Hypertransport links?

And/Or this could be the explanation for AMD taking a paid license to
Rambus IP a while back??
 
George said:
And/Or this could be the explanation for AMD taking a paid license to
Rambus IP a while back??
I would say "neither" based on the following in the press release..

"Designed specifically to handle a broad spectrum of scientific and
commercial applications, the supercomputer design will include new,
highly sophisticated software to orchestrate over 16,000 AMD Opteron(TM)
processor cores and over 16,000 Cell B.E. processors in tackling some of
the most challenging problems in computing today. The revolutionary
supercomputer will be capable of a peak performance of over 1.6
petaflops (or 1.6 thousand trillion calculations per second).

The machine is to be built entirely from commercially available hardware
and based on the Linux(R) operating system. IBM(R) System x(TM) 3755
servers based on AMD Opteron technology will be deployed in conjunction
with IBM BladeCenter(R) H systems with Cell B.E. technology. Each system
used is designed specifically for high performance implementations."

So you can look up the Cell Blades and the 3755 server.
 
Del said:
I would say "neither" based on the following in the press release..

"Designed specifically to handle a broad spectrum of scientific and
commercial applications, the supercomputer design will include new,
highly sophisticated software to orchestrate over 16,000 AMD Opteron(TM)
processor cores and over 16,000 Cell B.E. processors in tackling some of
the most challenging problems in computing today. The revolutionary
supercomputer will be capable of a peak performance of over 1.6
petaflops (or 1.6 thousand trillion calculations per second).

The machine is to be built entirely from commercially available hardware
and based on the Linux(R) operating system. IBM(R) System x(TM) 3755
servers based on AMD Opteron technology will be deployed in conjunction
with IBM BladeCenter(R) H systems with Cell B.E. technology. Each system
used is designed specifically for high performance implementations."

I wonder what the rationale is behind using two different instruction
set architectures is? What sort of problems will be sent to the Opterons
and what sort will be sent to the Cells? Why not use Cells for it all?

Yousuf Khan
 
Yousuf said:
I wonder what the rationale is behind using two different instruction
set architectures is? What sort of problems will be sent to the Opterons
and what sort will be sent to the Cells? Why not use Cells for it all?

Risk reduction, I would think. Current developer tools for Cell are
fairly primeval. Oh, sure, gcc exists and compiles programs. But hand
over the Cell to an average C coder and watch the fun ensue. One
currently has to code what executes on the SPUs using gcc intrinsics
(aka glorified assembly.) That's not so bad, per se, but what gets
interesting is watching people get their minds around hand
parallelizing and vectorizing their code and then watching them debug.

Having x86_64 around means that you can run a chunk of code using
well-understood tools.

I sense there's a new evolution in compilers going to happen in the
near future to address these multi-core processor issues. IBM's not the
only multi-core processor with an avant-garde design; compilers will
have to deal with Niagara's threading intricacies too. I wouldn't
expect to see much software that takes advantage of the SPUs in the
near future. My understanding is that game engine developers are
likewise staying away from using the SPUs at this point in time.

<shameless plug> Some of the afternoon speakers at my "General-Purpose
GPU: Practice and Experience" workshop will be talking about these very
issues. Workshop's web page is at http://www.gpgpu.org/sc2006/workshop/
</shameless plug>
 
Scott Michel said:
Risk reduction, I would think. Current developer tools for Cell are
fairly primeval. Oh, sure, gcc exists and compiles programs. But hand
over the Cell to an average C coder and watch the fun ensue. One
currently has to code what executes on the SPUs using gcc intrinsics
(aka glorified assembly.) That's not so bad, per se, but what gets
interesting is watching people get their minds around hand
parallelizing and vectorizing their code and then watching them debug.

Isn't the instruction-set for the Cell dependent on what memory accesses you
are going to use? Access to local memory vs. accessing remote memory of
sorts...
 
Scott Michel said:
Yousuf said:
Del Cecchi wrote:
[...]


I sense there's a new evolution in compilers going to happen in the
near future to address these multi-core processor issues. IBM's not the
only multi-core processor with an avant-garde design; compilers will
have to deal with Niagara's threading intricacies too.

Some nit picking here, sorry:


What threading intricacies, exactly? FWIW, I address scalability with
lock-free reader patterns and high-performance memory allocators:


http://groups.google.com/group/comp...ee855/c36b50d37c2ebaca?hl=en#c36b50d37c2ebaca


I would not feel intimidated by Niagara.. No special compilers are needed...
Just C, POSIX, and SPARC V9 assembly language will get you outstanding
scalability and throughput characteristics' on UltraSPARC T1...


Any thoughts?



BTS, I would be happy to discuss 64-bit lock-free programming on Niagara...
I have a T2000 and I can assert that all of the "threading intricacies" are
efficiently solved through clever use of lock-free programming...
 
Scott said:
Risk reduction, I would think. Current developer tools for Cell are
fairly primeval. Oh, sure, gcc exists and compiles programs. But hand
over the Cell to an average C coder and watch the fun ensue. One
currently has to code what executes on the SPUs using gcc intrinsics
(aka glorified assembly.) That's not so bad, per se, but what gets
interesting is watching people get their minds around hand
parallelizing and vectorizing their code and then watching them debug.

But that's quite the hedge, 16 000 Opterons to backup 16 000 Cells?

What I was really getting at was whether there's some particular set of
FP problems that done better on Opteron, while others are done better on
Cell?

Also Cray seems to create systems with management processors, where a
few processors are dedicated to the tasks such as traffic management and
i/o access. Perhaps the Opterons are better at this sort of task than
the Cells?

Speaking of Cray, they seem to be getting very fond of pairing Opterons
with Clearspeed processors now.

Yousuf Khan
 
Yousuf said:
Speaking of Cray, they seem to be getting very fond of pairing Opterons
with Clearspeed processors now.

Yousuf Khan

Sorry, instead of Clearspeed that should read, DRC Computers' chips.

DRC Computer Corporation
http://www.drccomputer.com/

I think Sun is packaging Clearspeed chips with their Opterons, rather
than Cray. Lots of choices available I guess.

Yousuf Khan
 
In comp.arch Scott Michel said:
I sense there's a new evolution in compilers going to happen in the
near future to address these multi-core processor issues. IBM's not the
only multi-core processor with an avant-garde design; compilers will
have to deal with Niagara's threading intricacies too. I wouldn't
expect to see much software that takes advantage of the SPUs in the
near future. My understanding is that game engine developers are
likewise staying away from using the SPUs at this point in time.

Or maybe what happens is what has happened times again, and the magic
compilers fail to show up. Especially more so compilers that can work
their magic on bad old code.
 
Chris said:
Isn't the instruction-set for the Cell dependent on what memory accesses you
are going to use? Access to local memory vs. accessing remote memory of
sorts...

No question that data and message orchestration are going to be keeping
compiler researchers very happy for the foreseeable future. Your
question only applies to the SPUs, however. Existing tools will work
just fine on the PPC64 primary processor.

But the original question why both Cell and Opteron...?
 
Chris said:
Some nit picking here, sorry:

Wouldn't be USENET if there weren't... :-)
What threading intricacies, exactly? FWIW, I address scalability with
lock-free reader patterns and high-performance memory allocators:

Hot lock contention that ends up serializing threads. More of a poor
programming practice in multithreaded applications than a processor
problem. It's something that has to be considered, although compilers
won't necessarily dig one out of that hole.
I would not feel intimidated by Niagara.. No special compilers are needed...
Just C, POSIX, and SPARC V9 assembly language will get you outstanding
scalability and throughput characteristics' on UltraSPARC T1...

I'm not intimidated by Niagara. My agenda is twofold: (a) doing
technology refresh risk assessments for various customers, (b) looking
for the next cool research topic for the next 5-year research epoch.
Lock-free is usually good (personally, I've always been a fan of
LL-SC), but sometimes seemed to lead to pathological conditions.
Pathological conditions are generally bad for embedded or space
systems.
BTS, I would be happy to discuss 64-bit lock-free programming on Niagara...
I have a T2000 and I can assert that all of the "threading intricacies" are
efficiently solved through clever use of lock-free programming...

Cool. Would like to hear more about better practices.
 
Yousuf said:
But that's quite the hedge, 16 000 Opterons to backup 16 000 Cells?

What I was really getting at was whether there's some particular set of
FP problems that done better on Opteron, while others are done better on
Cell?

Cell's single FP is just like nVidia and ATI GPUs: They round to 0
(truncate). This means that you have to resort to iterative
error-correcting algorithms to compensate for the inevitable numerical
drift. You don't want take a significant double FP performance hit on
Cell (LLNL already has a paper out on this that circulated in the
newsgroup a while back.)

It turns out that even with this implementation of single FP and having
to iterate, you're still going to be faster than the double FP unit.
Turns out to be true on Intel's superscalar too.
Also Cray seems to create systems with management processors, where a
few processors are dedicated to the tasks such as traffic management and
i/o access. Perhaps the Opterons are better at this sort of task than
the Cells?

Dunno. My personal opinion is that it's just risk reduction given the
state of the developer tools.
 
Scott Michel said:
Wouldn't be USENET if there weren't... :-)


Hot lock contention that ends up serializing threads.

Yeah... You can distribute the locks with a hash to help out in this area:


http://groups.google.com/group/comp...lnk=gst&q=multi-mutex&rnum=1#3ca11e0c3dcf762c


Something like lock-based transactional memory...




FWIW, here are some of my thoughts on transactional memory:


http://groups.google.com/group/comp...b0a40/5f4afc338f3dd221?hl=en#5f4afc338f3dd221


http://groups.google.com/group/comp...8ae64/eefe66fd067bdb67?hl=en#eefe66fd067bdb67


http://groups.google.com/group/comp.programming.threads/msg/7c4f5ba87e36fd79?hl=en


As you can see, I don't like transactional memory very much...

;^(...



More of a poor
programming practice in multithreaded applications than a processor
problem. It's something that has to be considered, although compilers
won't necessarily dig one out of that hole.
Agreed.





I'm not intimidated by Niagara.

Good to hear... Suff like this has me weary of programmers skills wrt
multi-threading:


http://groups.google.com/group/comp...47926/5301d091247a4b16?hl=en#5301d091247a4b16
(read all)


IEEE fellow seems to think threads are far to complicated for any "normal"
programmer to even begin to grasp...



My agenda is twofold: (a) doing
technology refresh risk assessments for various customers, (b) looking
for the next cool research topic for the next 5-year research epoch.
Lock-free is usually good (personally, I've always been a fan of
LL-SC),

Yeah.. More on this at *end of msg...



but sometimes seemed to lead to pathological conditions.
Pathological conditions are generally bad for embedded or space
systems.

Please clarify...

Well, I has been my experience that "loopless" lock-free algorithms are the
best for real time systems... For instance, take a lock-free
single/producer-consumer queue into account... If a real time system is
going to use this queue, it has to have a explicit answer for exactly how
long its push and pop operations will take, no matter what the load of the
system is like... For a lock-free queue to be usable to a hard real-time
system it has to be able to assert that its push operation is loopless and
has exactly X instructions, and its pop operation is loopless and has
exactly X instructions.


Here is an example of my implementation of such a queue:


http://appcore.home.comcast.net/

http://groups.google.com/group/comp...41352/d154b56f0f233cef?hl=en#d154b56f0f233cef


LL/SC does not really fit the bill... You have to implement logic that uses
LL/SC in a loop. You can predict exactly how many times a thread will retry.
Its similar to the live-lock-like situations that are inherent in
obstruction-free algorithms...


Is this the kind 'pathological conditions' you were getting at?



Cool. Would like to hear more about better practices.

*Well, read all of this to start off:


http://groups.google.com/group/comp...gst&q=chris+thomasson&rnum=6#04cb5e2ca2a7e19a


Where do you want to go from here?

Humm...
 
Ooops!

Chris Thomasson said:
Scott Michel said:
Chris Thomasson wrote:
[...]

LL/SC does not really fit the bill... You have to implement logic that
uses LL/SC in a loop. You can predict exactly how many times a thread will
retry.
^^^^^^^^^^


You CAN'T predict exactly how many times a thread will retry.

Its similar to the live-lock-like situations that are inherent in
obstruction-free algorithms...


Is this the kind 'pathological conditions' you were getting at?

[...]


Sorry for any confusion.
 
Or maybe what happens is what has happened times again, and the magic
compilers fail to show up. Especially more so compilers that can work
their magic on bad old code.

That sounds about right to me.

Even for traditional SIMD MPP designs compilers don't do a very good
job with naive code, and those designs (and compilers targeting them)
have been around for decades now. Conversely, there have been several
languages and language extensions targeting such hardware that do let
programmers write good parallel code with a minimum of pain -- which
almost nobody has adopted. In my own personal experience, llc and mpc
from the mid-1980s come immediately to mind: simple extensions to C
giving parallel datatypes and operations, with an open-source compiler,
which nobody outside of one small research group ever adopted. Such
a language might be well suited to things like Cell -- but don't count
on anyone ever learning to use it.
 
Or maybe what happens is what has happened times again, and the magic
compilers fail to show up. Especially more so compilers that can work
their magic on bad old code.

Hey, the compiler writers had all their brain cells used up
trying to generate code for the x86 architecture.
You're gonna have to wait for a whole new generation
of compiler writers, which is gonna be tricky since
practically every university computer science program
is now nothing but web design and javascript :-).
 
But the original question why both Cell and Opteron...?

Opteron so they can get the performance they need?
Cell because IBM makes 'em and they can unload a bunch
of them on the gummint while they are at it?
(Just a theory :-).
 
Back
Top