Del said:
Is BiSection bandwidth really a valid metric for very large clusters?
Yes, if you want to do FFT's, or, indeed, any kind of non-local
differencing.
It seems to me that it can be made arbitrarily small by configuring a
large enough group of processors, since each processor has a finite
number of links. For example a 2D mesh with nearest neighbor
connectivity has a bisection bandwidth that grows as the square root of
the number of processors. But the flops grow as the number of
processors. So the bandwidth per flop decreases with the square root
of the number of processors.
That's the problem with the architecture and why I howled so loudly
when it came out. Naturally, I was ridiculed by people whose entire
knowledge of computer architecture is nearest neighbor clusters.
Someone in New Mexico (LANL or Sandia, I don't want to dredge up the
presentation again) understands the numbers as well as I do. The
bisection bandwidth is a problem for a place like NCAR, which uses
pseudospectral techniques, as do most global atmospheric simulations.
The projected efficiency of Red Storm for FFT's was 25%. The
efficiency of Japan's Earth Simulator is at least several times that
for FFT's. No big deal. It was designed for Geophysical simulations,
Blue Gene at Livermore was bought to produce the plots the Lab needed
to justify its own existence (and not to do science). As you have
correctly inferred, the more processors you hang off the
nearest-neighbor network, the worse the situation becomes.
I can't think of why this wouldn't apply in general but don't claim that
it is true. It just seems so to me (although the rate of decrease
wouldn't necessarily be square root)
Unless you increase the aggregate bandwidth, you reach a point of
diminishing returns. The special nature of Linpack has allowed
unimaginative bureacrats to make a career out of buying and touting
very limited machines that are the very opposite of being scalable.
"Scalability" does not mean more processors or real estate. It means
the ability to use the millionth processor as effectively as you use
the 65th. Genuine scalability is hard, which is why no one is really
bothering with it.
Apparently no one with money is interested in solving these special
problems for which clusters are not good enough. See SSI and steve
Chen, history of.
The problems aren't as special as you think. In fact, the glaring
problem that I've pointed out with machines that rely on local
differencing isn't agenda or marketing driven, it's an unavoidable
mathematical fact. As things stand now, we will have ever more
transistors chuffing away on generating ever-less reliable results.
The problem is this: if you use a sufficiently low-order differencing
scheme, you can do most of the problems of mathematical physics on a
box like Blue Gene. Low order schemes are easy to code, undemanding
with regard to non-local bandwidth, and usually much more stable than
very high-order schemes. If you want to figure out how to place an
air-conditioner, they're just fine. If you're trying to do physics,
the plots you produce will be plausible and beautiful, but very often
wrong.
There is an out that, in fairness, I should mention. If you have
processors to burn, you can always overresolve the problem to the point
where the renormalization problem I've mentioned, while still there,
becomes unimportant. Early results by the biggest ego in the field at
the time suggested that it takes about ten times the resolution to do
fluid mechanics with local differencing as accurately as you can do it
with a pseudospectral scheme. In 3-D, that's a thousand times more
processors. For fair comparison, the number of processors in Livermore
box would be divided by 1000 to get equivalent performance to a box
that could do a decent FFT.
Should be posting to comp.arch so people there can switch from being
experts on computer architecture to being experts on numerical analysis
and mathematical physics.
Robert.