As Robert correctly points out, there are problems associated with
using lotsa COTS microprocessors as the basis for a supercomputer
(e.g. Red Storm). These problems are real.
All of us would rather have one processor and memory system that is
10,000 times faster than an Opteron and its DRAM memory. Alas, such a
device is not available in the real world. In the real world, the
only alternative that's available is the vector processor a la Japan's
Earth Simulator, at ~$600M per copy.
The available evidence suggests that most folks with a checkbook
believe the 10K+ COTS approach provides a better tradeoff than a
vector machine. Neither my opinion nor Robert's counts since neither
of us owns a large enough checkbook.
I respectfully disagree with Robert about ccNUMA being a marketing
gimmick. Red Storm **is** cache coherent. This is a fact, not an
opinion. Robert is free to suggest that ccNUMA is not a panacea -
nobody claims it is - but IMHO it's more than a gimmick.
In order for cache coherency to make sense as a useful concept (world
according to RM, obviously), remote latencies have to fit the
requirement that another poster imposed on them in comp.arch: they
have to be comparable to local latencies. Such a requirement on
remote latencies is in general unreasonable and unattainable for MPP,
but that also means that cache coherency (world according to RM,
obviously) for a NUMA supercomputer is not a useful concept. That's
in contrast to AMD's original concept of a small (up to eight way)
cluster, where remote latencies are a small multiple of local
latencies, and the idea of ccNUMA is a useful concept. All of this
with a possible caveat.
The caveat has to do with the actual mechanics of message passing
and/or RDMA. I don't think that remote memory reads and writes for
RedStorm are necessarily limited to MPI, and even if they were,
writing to a (remote) memory location is surely a lower-overhead
operation than writing to an I/O socket. At this level of detail, I
am more than happy to admit that I don't really know what I'm talking
about.
Were I posting to newsgroups in Japanese, I probably would have been
jumping up and down, hooting and hollering about the economics of
Earth Simulator. We don't know what the economics of the Cray SV-2
aka X-1 would be if it ever achieved significant market volume, but
it's all speculation, since such a machine is probably never going to
achieve significant market volume.
A dense mesh network with one router and one garden-variety processor
per compute node (the architecture of both Blue Gene and Red Storm)
and an Earth Simulator style vector processor are not the only
possibilities. The Cray X-1 (aka SV2) is significantly more cost
effective than ES. NSA special order machines like the X-1 probably
won't make much of a dent in HPC even if a place like ORNL
occasionally breaks down and buys one, but that doesn't mean that
streaming architectures won't. Whether the DoE (which always has the
biggest checkbook) picks up on streaming architectures or not,
somebody else will.
Always, of course, with the greatest of respect.
RM