First, an obligatory disclaimer: I'm still learning about Red Storm.
The following statements are based on the current state of my
knowledge. I'll try real hard to make some good guesses. ;-)
I don't post on Usenet for the pleasure of catching other people in
mistakes, and I don't think you do, either, so we can just have a
conversation.
The fact that DMA message passing into the local Dram is used
necessitates that the data being overwritten is inconsequential.
Let me try a translation. If the programmer has not done something
dangerous, it should not, indeed, cannot matter that data are being
overwritten in local memory by DMA. Otherwise, running the code with
different, unrelated activity on the sytem could produce different
results. <end attempted translation>
The only thing that is accomplished by "cache-coherency" in Red Storm
is that, should the processor by some chance, be holding in cache a
piece of data being overwritten in main memory by DMA, the cache will
be updated at the same time.
Therefore, if any of the data being overwritten is in cache, replacing
it via snooping is also inconsequential.
Red alert! Whoop! Whoop! Translator banks drawing inconsistent
conclusions from external stimuli. I thought the whole point is that
the processor does snoop the DMA write and does update the cache.
The data being overwritten _must_ be inconsequential. It is the
programmers' task to make certain this is the case. Nobody ever said
programming a message-passing 10K+ CPU MPU was easy.
Well, not if you go about it the way most people do these days.
(What, he says, you think you know a better way? Yes, I think I do).
box
This is not my understanding, Robert. I'll try to keep this on-topic
about ccNUMA and not pursue this further.
Processor A and B have a copy of the same memory location. Processor
A and B by an unfortunate coincidence (and though the incompetence of
the programmer who allowed the situation to arise) decide to use the
data at the same time. Processor A uses the value to produce some
other result. Processor B changes the value. Processor A snoops the
change and corrects the value it has in cache, but a result from
Processor A is on its way elsewhere that would have been different had
the timing been just a little different. No different from what
happens in the ccNUMA case, as far as I can see.
Snooping the passed message *does in fact* invalidate the
(inconsequential) data in your cache and updates it with a fresh copy.
So?
This conversation may not be about much more than:
ccNUMA for Red Storm is almost free.
It also doesn't really buy you much of anything, but since it's almost
free, it doesn't matter very much that it's almost worthless.
I read the above several times, Robert, and I still don't understand
what you're saying. This is probably my limitation.
Megacorp International has a credit line of x. Processor A and
processor B are each handling transactions for Megacorp International.
Processor B gets there first and puts a lock on the value of memory
location x. Processor B changes the value of x and releases the lock
on that memory location. Processor A learns that the data can be
used, and does so without worrying about having to go out to memory to
fetch a fresh copy because the change was snooped and the value in its
cache updated in less time than it would take to do a memory fetch.
On RedStorm, the minimum time for a lock that has to go onto the
network is 4 us, compared to a memory fetch round-trip of under 200
ns. Whether you snoop the DMA or just fetch a fresh copy makes an
insignificant difference in the amount of time you can't use that data
on RedStorm. The one real payoff that I can see is that the lock
itself is a data item, and having the processor snoop the changed lock
as it arrives saves the processor from having to poll the lock to see
if the data can be used.
You have just opened a brand-new can of worms. There are several
forms of NUMA. One is the Red Storm version, where each CPU has a
totally independent memory, accessable by other CPUs only via message
passing. The 4-way Opteron system is a completely different type of
NUMA since each CPU can address the other CPUs' memory. However, it
addresses the other memory at a different address. This means each
CPU's cache must snoop the other 3's memory, as well as its own.
Thus, the largish number of high-speed links.
Right, but the fact that the processors snoop one another's cache
makes a big differerence _proportionally_ in how long data that have
been locked by one processor are unavailable for use by another
processor.
Wrong. Red Storm is absolutely perfectly cache coherent. There are
no corners or special cases where this is not true. 100% perfection.
The penalty is that only one CPU gets to send messages at a time. And
the programmer must avoid overwriting valid data when passing a
message.
Let's put it this way: I think RedStorm's cache-coherence has about as
much value as P4 detractors think hyperthreading has.
Whoa, Nellie! You mean, when the partition is moved that a swarm of
technicians physically remove or install wiring? Huh??
When I was in the business (and I no longer am) a computer doing
classified work could have no network connections to any unclassified
environment. It isn't necessary for me to guess at how the
corresponding requirement can be met under current regulations, but
you may safely count on the impossibility of any message getting from
Red to Black or vice versa.
Absolutely correct. This is the problem with a message-passing MPU.
The unfortunate fact is, there is no practical way around the problem
of interconnecting 10K+ CPUs. Otherwise, everybody would use that
practical way, hmm?
There are these things called _switches_. The cost of _just_ the
switch for a beowulf cluster even with fairly high-end compute nodes
is changed significantly by going from a switchless mesh to a switched
network if you're using a low-latency, high-bandwidth interconnect.
Now the problem with _switches_ is that they significantly raise the
cost of the installation without significantly raising your Top 500
ranking.
Want to experience sticker shock? Price out an SGI Altix 3300 (ring)
vs Altix 3700 (switched).
Best Top 500 per dollar? Leave out the switches. Also has the neat
effect of maximizing IT staff at National Laboratories, because they
are working with an RSA (Really Stupid Architecture). You've heard me
talk about this before in a group larded with DoE vassals and
retainers. Boy do I get an unfriendly reception. Smarter
architecture, faster development, less money to vassals and retainers.
By swarms of technicians physically installing/removing wiring?
(Sorry, Robert, that was a cheap shot that I just couldn't resist. I
ain't perfect.
If they could call them IT professionals and use it to inflate their
budgets and their staffs, they probably would. One cheap shot
deserves another, although this cheap shot was definitely not aimed at
you. ;-).
They're the only game in town. We'd all love to have equivalent
performance in a really fast one-CPU supercomputer, but that just
ain't possible. Alas.
_Not_ the only game in town. NASA doesn't buy boxes like that. NRL
doesn't buy boxes like that. NSA doesn't buy boxes like that. Only
the DoE, with its heavy thumb on national policy and tons of vassals
and retainers to justify big salaries for half-wit muckity-mucks buys
boxes like that. Oh, yes, and the DoE has an unseemly relationship
with IBM. How did Cray get into it? They had to do _something_ to
show apparent support for supposedly real HPC, as opposed to high
school shop projects and national subsidies to IBM.
SGI is working on boxes that have both vector and scalar processors.
Now _there's_ a thought. SGI, in all likelihood, will go safely out
of business before they can interfere with the favorites chosen by the
DoE. Wonder what SGI did wrong?
For one specific algorithm, it is sometimes *possible* (in principle)
to design the algorithm flow into hardware. There isn't enough money
in the world to pay for this for lotsa algorithms. Double alas. ;-)
No, but you can do one helluva lot better than racking up as many COTS
processors with as much cable as you can afford to buy.
RM