IBM to build Opteron-Cell hybrid supercomputer of 1 PetaFlop performance

The little lost angel · Sep 10, 2006

You're gonna have to wait for a whole new generation
of compiler writers, which is gonna be tricky since
practically every university computer science program
is now nothing but web design and javascript .

That's bull, it's web design and JAVA

Scott Michel · Sep 11, 2006

Sander said:
Or maybe what happens is what has happened times again, and the magic
compilers fail to show up. Especially more so compilers that can work
their magic on bad old code.

gcc doesn't really help you if you don't know what you're doing. Loop
unrolling comes to mind: can't tell you how many times I've had to
forcibly do loop unrolling where one would have expected gcc to do it
with "-O3 -funroll-loops".

There is some hope on the horizon, like LLVM from UIUC, which you'll
see underneath hood in OS X "Leopard". I'm not sure if I'd expect to
see Cell SPU support in Java, although IBM will likely make that
happen. Sure, compilers can take hints, but it seems to me that an
interpretive system, like LLVM or Python, to take the "bird's eye" view
and dispatch tasks to SPUs. Simple loop-level parallelism, while
common, is likely the wrong level of granularity.

Robert Redelmeier · Sep 12, 2006

In comp.sys.ibm.pc.hardware.chips Scott Michel said:
gcc doesn't really help you if you don't know what you're doing.

Agreed `gcc` can be cantankerous.

Loop unrolling comes to mind: can't tell you how many times
I've had to forcibly do loop unrolling where one would have
expected gcc to do it with "-O3 -funroll-loops".

Loop unrolling is not as useful on modern processors (I do not
consider the Pentium4 "modern") as it used to be: It dilutes the
I-cache and forces more fetches, and the cost of branching/looping
is relatively low with decent branch prediction and parallel
OoO exec. An unroll of 2x or 4x should be more than enough for
the ROB to chew on.

-- Robert

Scott Michel · Sep 13, 2006

Robert said:
Agreed `gcc` can be cantankerous.

Loop unrolling is not as useful on modern processors (I do not
consider the Pentium4 "modern") as it used to be: It dilutes the
I-cache and forces more fetches, and the cost of branching/looping
is relatively low with decent branch prediction and parallel
OoO exec. An unroll of 2x or 4x should be more than enough for
the ROB to chew on.

I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new = y_old + alpha * x" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)

Robert Redelmeier · Sep 13, 2006

In comp.sys.ibm.pc.hardware.chips Scott Michel said:
Robert said:

Loop unrolling is not as useful on modern processors (I do not
consider the Pentium4 "modern") as it used to be: It dilutes the
I-cache and forces more fetches, and the cost of branching/looping
is relatively low with decent branch prediction and parallel
OoO exec. An unroll of 2x or 4x should be more than enough for
the ROB to chew on.

Click to expand...

I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new = y_old + alpha * x" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)

If I understand you correctly, the GPU benefitted from
the unrolling. I'm hardly surprised. But are you sure you
weren't comparing memory speeds more than processing speeds?
Try it on a working set size that fits inside L1.

40 MFLOPS corresponds to about 480 Mbyte/s which might be
all that system can sustain for interleaved read-read-write.
GPU (graphics processing units, I assume) have _much_ higher
bandwidth, at least to local memory.

-- Robert

Scott Michel · Sep 14, 2006

Robert said:
In comp.sys.ibm.pc.hardware.chips Scott Michel said:

I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new = y_old + alpha * x" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)

Click to expand...

If I understand you correctly, the GPU benefitted from
the unrolling. I'm hardly surprised. But are you sure you
weren't comparing memory speeds more than processing speeds?
Try it on a working set size that fits inside L1.

40 MFLOPS corresponds to about 480 Mbyte/s which might be
all that system can sustain for interleaved read-read-write.
GPU (graphics processing units, I assume) have _much_ higher
bandwidth, at least to local memory.

The reverse. The GPU can't do loop unrolling, since it controls the
entire iteration through the matrix being processed (it's implied
looping, to be precise.) It was the AMD-64 for which I had to do the
manual unrolling.

gcc is not your friend.

Phil Armstrong · Sep 14, 2006

Scott Michel said:
In comp.sys.ibm.pc.hardware.chips Scott Michel said:

I still find it useful. I was doing some basic performance measurements
on saxpy to compare an AMD-64 to a GPU and found I had to unroll the
"y_new = y_old + alpha * x" equation 16x to get around a GFLOP
on single precision numbers. By contrast, "-O3 -funroll-loops" and
"-O3" was very disappointing at around 40MFLOPs or so (although it did
show that a GPU can by far outperform the AMD-64 and gcc.)

Click to expand...

Click to expand...

[snip]
gcc is not your friend.

Was the loop not being unrolled at all by gcc? Did -funroll-all-loops
help?

Phil

Bernd Paysan · Sep 15, 2006

Scott said:
The reverse. The GPU can't do loop unrolling, since it controls the
entire iteration through the matrix being processed (it's implied
looping, to be precise.) It was the AMD-64 for which I had to do the
manual unrolling.

gcc is not your friend.

More than 10 years ago, when I still was a student, one of the PhDs of the
numeric faculty made a matrix multiply competition for the HP-RISC CPUs we
had on our workstations. He estimated that 30MFLOPs would be possible, even
though a naive C loop could get less than 1MFLOP, and the HP Fortran
compiler with a build-in "extremely fast" matrix multiplication got no more
than 10MFLOPs.

After doing some experiments, I got indeed 30MFLOPs out of the thing, by
doing several levels of blocking. The inner loop kept a small submatrix
accumulator (as much as did fit, I think I got 5x5 into the registers), so
that several rows and columns could be multiplied together in one go
(saving a lot of loads and stores). The next blocking level was the (quite
large) cache of the PA-RISC machine, i.e. subareas of both matrixes where
multiplied together.

I never got around making the matrix multiplication routine general purpose
(the benchmark one could only multiply 512x512 matrixes), but today, this
sort of blocking is state of the art in high performance numerical
libraries. GCC isn't your friend, because loop unrolling here is really the
wrong approach. The inner loop I used just did all the multiplications for
the 5x5 submatrix, and no further unrolling was necessary.

IBM to build Opteron-Cell hybrid supercomputer of 1 PetaFlop performance	1	Sep 7, 2006
Roadrunner Supercomputer using 12,960 CELL Processors Hits 1 PetaFlop(1000 TeraFlops) of double-prec	27	Jun 9, 2008
IBM announces PowerXCell 8i	0	May 19, 2008
Linux: It doesn't get any faster	17	Jun 25, 2009
Japan Beats IBM in Supercomputer Stakes	1	Aug 1, 2006
IBM Power7 @ Hot Chips 21 conference: 1.2B Transistors	0	Aug 26, 2009
IBM Microprocessors to Power the New Wii U System from Nintendo	0	Jun 7, 2011
CELL 2 "Enhanced Cell Broadband Engine" to be revealed soon	2	Apr 12, 2007

IBM to build Opteron-Cell hybrid supercomputer of 1 PetaFlop performance

The little lost angel

Scott Michel

Robert Redelmeier

Scott Michel

Robert Redelmeier

Scott Michel

Phil Armstrong

Bernd Paysan

Ask a Question

Similar Threads