Using a GPU for supercomputing?

  • Thread starter Thread starter YKhan
  • Start date Start date
Y

YKhan

I always thought that it was not desirable to do supercomputing on a
GPU, because the floating point is single-precision only, and HPC
requires double-precision? Has that requirement now been relaxed?

AMD Makes Graphics Chip Perform Unnatural Acts @ STORAGE & SECURITY
JOURNAL
"It's a tweaked ATI graphics processor - the year-old R580 core and its
48 cores used in Radeon X1900 graphics cards - but now appearing on a
PCI Express add-in board with 1GB of GDDR3 memory capable of doing a
massive 360 gigaflops in compute-intensive stream computing
applications - if there were any."
http://issj.sys-con.com/read/303048.htm
 
YKhan said:
I always thought that it was not desirable to do supercomputing on a
GPU, because the floating point is single-precision only, and HPC
requires double-precision? Has that requirement now been relaxed?
Judging by private correspondence I had (and, I wish pointedly to add,
not with anyone from IBM), I'm guessing that double precision floating
point was added to Cell's capability with HPC and IBM's biggest
customer in mind. I was told that Cell (and IBM in general) were not
players because of the double precision issue. Next thing I know, Cell
has double precision FP.

That said, there are plenty of high-performance applications that
require only single precision. Mercury Computer systems, a DoD
supplier, who is now building systems based on Cell, has for years
produced systems using DSP chips that were only single-precision.

For general purpose HPC, though, you are exactly correct: DP FP is a
requirement, and GPU's don't yet have DP FP. I expect that to change,
though, as people get better at exploiting the GPU architecture and the
resources of the GPU are increasingly eyed for capability.

FAH's decision to use chips with SP FP is an interesting one. Usually,
for calculations with many time steps, like long-term predictions of
orbits, precision is a big issue, and extended precision is often used.
I'm not familiar with the discussions, but they may feel that,
compared to the crudeness of some of the approximations used, floating
point precision is a small issue.

Robert.
 
Robert said:
That said, there are plenty of high-performance applications that
require only single precision. Mercury Computer systems, a DoD
supplier, who is now building systems based on Cell, has for years
produced systems using DSP chips that were only single-precision.

For general purpose HPC, though, you are exactly correct: DP FP is a
requirement, and GPU's don't yet have DP FP. I expect that to change,
though, as people get better at exploiting the GPU architecture and the
resources of the GPU are increasingly eyed for capability.

Well, I'm wondering if this has something to do with a changing nature
in the market for HPC? I guess a lot of the HPC market now includes
movie special effects companies (even if they don't yet figure in the
Top500 lists). Their end product may not require as much precision as
previous generations of HPC software usage.

Yousuf Khan
 
Yousuf said:
Well, I'm wondering if this has something to do with a changing nature
in the market for HPC? I guess a lot of the HPC market now includes
movie special effects companies (even if they don't yet figure in the
Top500 lists). Their end product may not require as much precision as
previous generations of HPC software usage.

As a frequent poster to comp.arch would tell you, the Top 500 list
doesn't include a significant number of huge clusters owned by users
who have no particular desire to tell the world what they are doing or
how much computer capacity it takes to do it. That, and running a
Linpack benchmark on a huge cluster is expensive. Big clusters that
wouldn't show up on the Top 500 include oil companies, computer
animation clusters, and three letter agencies doing spook work.

I'm sure that I've seen articles about computer animation houses using
GPU's as computation engines so, in a sense, that bridge has already
been crossed.

The real problem is software. Building a high-performance stream
processor is not the real challenge. Getting it to do something useful
in anything other than a highly-specialized niche is. What we're
watching (I think) is a slow migration of mind share to stream
processors because:

1. Processors with register files have reached a performance wall.
2. In raw compute capacity, there are off-the-shelf stream processors
that can outperform conventional microprocessors.
3. People are gradually figuring out how to get things done with stream
processors.

People aren't going to go out and build stream processors with DP FP if
there is no guaranteed market. DSP chips with DP FP are available, but
not a big part of the market.

You can be certain that, when IBM added DP to Cell, it had a guarantee
of a market. In the meantime, people are going to be doing whatever
they can with whatever they can buy off the shelf. When enough people
become believers, stream processors with DP FP will become as common as
conventional microprocessors with DP FP.

Robert.
 
Robert said:
Yousuf Khan wrote:




As a frequent poster to comp.arch would tell you, the Top 500 list
doesn't include a significant number of huge clusters owned by users
who have no particular desire to tell the world what they are doing or
how much computer capacity it takes to do it. That, and running a
Linpack benchmark on a huge cluster is expensive. Big clusters that
wouldn't show up on the Top 500 include oil companies, computer
animation clusters, and three letter agencies doing spook work.

I'm sure that I've seen articles about computer animation houses using
GPU's as computation engines so, in a sense, that bridge has already
been crossed.

The real problem is software. Building a high-performance stream
processor is not the real challenge. Getting it to do something useful
in anything other than a highly-specialized niche is. What we're
watching (I think) is a slow migration of mind share to stream
processors because:

1. Processors with register files have reached a performance wall.
2. In raw compute capacity, there are off-the-shelf stream processors
that can outperform conventional microprocessors.
3. People are gradually figuring out how to get things done with stream
processors.

People aren't going to go out and build stream processors with DP FP if
there is no guaranteed market. DSP chips with DP FP are available, but
not a big part of the market.

You can be certain that, when IBM added DP to Cell, it had a guarantee
of a market. In the meantime, people are going to be doing whatever
they can with whatever they can buy off the shelf. When enough people
become believers, stream processors with DP FP will become as common as
conventional microprocessors with DP FP.

Robert.
Just a small note about Blue Gene/L

"IBM won two Gordon Bell Awards: The award for Peak Performance was
given to IBM's "Large-Scale Electronic Structure Calculations of High-Z
Metals on the Blue Gene/L Platform" team.

The award for Special Achievement was given to "The Blue Gene/L
Supercomputer and Quantum Chromodynamics" project team.

IBM swept the HPC Challenge Class One Awards winning the following
categories: HPL (Linpack) with Blue Gene/L and Blue Gene/W as runner-up;
Stream with Blue Gene/L and ASC Purple as runner up; FFT with Blue
Gene/L; and Random Access with Blue Gene/L and Blue Gene/W as runner up."

Gee I wonder what "High-Z" metals the team was calculating on? :-)
 
Del said:
Just a small note about Blue Gene/L

"IBM won two Gordon Bell Awards: The award for Peak Performance was
given to IBM's "Large-Scale Electronic Structure Calculations of High-Z
Metals on the Blue Gene/L Platform" team.

The award for Special Achievement was given to "The Blue Gene/L
Supercomputer and Quantum Chromodynamics" project team.

IBM swept the HPC Challenge Class One Awards winning the following
categories: HPL (Linpack) with Blue Gene/L and Blue Gene/W as runner-up;
Stream with Blue Gene/L and ASC Purple as runner up; FFT with Blue
Gene/L; and Random Access with Blue Gene/L and Blue Gene/W as runner up."

Gee I wonder what "High-Z" metals the team was calculating on? :-)

http://www.hpcc.gov/hecrtf-outreach/20040112_cra_hecrtf_report.pdf

"...developing realistic models of lanthanides and actinides on complex
mineral surfaces for environmental remediation, and for developing new
catalysts that are more energy efficient and generate less pollution."

The actinides are the obvious ones, the lanthanides are rare earth
elements. Presumably, the actinides are nuclear waste and the
lanthanides are being using as catalysts. The acquisition of Blue Gene
by LLNL was apparently a hurry-up job. Hanford?

The FFT result and the QCD result are more intriguing. IBM's own
documents showed that the per-processor efficiency of Blue Gene falls
precipitously for FFT's at a rather small processor count, and the
network has limitations that seem just as problematical for QCD. I
wouldn't be surprised to find that we are back to the days of huge
machines running at 10% efficiency. Either that, or something has been
done to beef up the network. Interesting results.

Robert.
 
YKhan said:
I always thought that it was not desirable to do supercomputing on a
GPU, because the floating point is single-precision only, and HPC
requires double-precision? Has that requirement now been relaxed?

http://www.hpcwire.com/hpc/692906.html

Less is More: Exploiting Single Precision Math in HPC
by Michael Feldman
Editor, HPCwire

Some of the most widely used processors for high-performance computing
today demonstrate much higher performance for 32-bit floating point
arithmetic (single precision) than for 64-bit floating point
arithmetic (double precision). These include the AMD Opteron, the
Intel Pentium, the IBM PowerPC, and the Cray X1. These architectures
demonstrate approximately twice the performance for single precision
execution when compared to double precision.

And although not currently widely used in HPC systems, the Cell
processor has even greater advantages for 32-bit floating point
execution. Its single precision performance is 10 times better than
its double precision performance.

At this point you might be thinking -- so what? Everyone knows double
precision rules in HPC. And while that's true, the difference in
performance between single precision and double precision is a
tempting target for people who want to squeeze more computational
power out of their hardware.

Apparently it was too tempting to ignore. Jack Dongarra and his fellow
researchers at the Innovative Computing Laboratory (ICL) at the
University of Tennessee have devised algorithms which use single
precision arithmetic to do double precision work. Using this method,
they have demonstrated execution speedups that correspond closely with
the expected single precision performance characteristics of the
processors.
 
Some of the most widely used processors for high-performance computing
today demonstrate much higher performance for 32-bit floating point
arithmetic (single precision) than for 64-bit floating point
arithmetic (double precision). These include the AMD Opteron, the
Intel Pentium, the IBM PowerPC, and the Cray X1. These architectures
demonstrate approximately twice the performance for single precision
execution when compared to double precision.

These must be the SSE numbers that they're talking about, because in
FPU, there should be no difference between single- and double-precision
performance.

Yousuf Khan
 
These must be the SSE numbers that they're talking about, because in
FPU, there should be no difference between single- and double-precision
performance.

Uhh, so how exactly do you propose to get 64 bits of data from memory
in the same amount of time that it takes to get 32-bits of data from
memory? :)

Even if we ignore the whole question of memory bandwidth (ie
theoretical max where our working sets fit in cache, a pretty
meaningless figure for much HPC work), then why shouldn't they be
refering to SSE/Altivec? SSE is almost always the fastest way to
handle floating point math in any x86 processor. As you correctly
state, you get twice the theoretical flops on single precision vs.
double precision values with SSE, however even just comparing the
double-precision throughput SSE is still faster than x87.

For the PowerPC with Altivec the difference might be even greater. I
believe some PPC chips can do 4 times as many theoretical
single-precision flops using Altivec as they can double-precision
flops using their standard floating point unit (Altivec doesn't do
double-precision last I checked).
 
Back
Top