Cray to buy Octigabay

Black Jack · Mar 4, 2004

Cray wants the low-end Opteron supercomputer and server market too.
They are already designing the RedStorm Opteron-based Strider systems:

http://makeashorterlink.com/?N27115B97

I wonder if the Octigabay 12K servers are cache-coherent or something?
I mean what's so special about them if they're just yet another
supercluster?

Yousuf Khan

Tony Hill · Mar 4, 2004

Yousuf, are you using your trolling/flamebait alias again? :>

Cray wants the low-end Opteron supercomputer and server market too.
They are already designing the RedStorm Opteron-based Strider systems:

http://makeashorterlink.com/?N27115B97

I wonder if the Octigabay 12K servers are cache-coherent or something?

Nope, or at least not in the classic sense of the term. Same deal as
with RedStorm.

I mean what's so special about them if they're just yet another
supercluster?

The only difference between Octigabay and traditional cluster designs
is that they're hanging their node interconnects right off of
hypertransport. Sound familiar? Yeah, they're doing pretty much the
same thing as Cray is doing with RedStorm, only on a somewhat smaller
scale, mainly targeting 12-144 processor setups.

Black Jack · Mar 8, 2004

Tony Hill said:
Yousuf, are you using your trolling/flamebait alias again? :>

Yup, I'm gonna have to do all of this from my Deja...er...Google
Groups account for the next several weeks. Speed of posting will also
be consistently lower, to reflect my demotion to dial-up Internet
connections for the next several weeks. :-)

That's one thing about countries new to the Internet, they've never
heard of the older parts of the Internet like Usenet, so everything up
here is web-based. Uggh.

Nope, or at least not in the classic sense of the term. Same deal as
with RedStorm.

What do you mean by "classic sense" vs. any other sense? It's either
cache coherent or it's not.

Black Widow's architecture still hasn't been fully disclosed by Cray
yet, except to give a very highly undetailed statement that it is
cache-coherent.

The only difference between Octigabay and traditional cluster designs
is that they're hanging their node interconnects right off of
hypertransport. Sound familiar? Yeah, they're doing pretty much the
same thing as Cray is doing with RedStorm, only on a somewhat smaller
scale, mainly targeting 12-144 processor setups.

Which I think some of the Intel fanboys were criticizing not so long
ago as one of the weaknesses of AMD64. Why aren't there peripherals
directly using Hypertransport instead of going through the PCI buses?
Well, it looks like there definitely are such peripherals now.
Extremely high speed ones at that.

Yousuf Khan

Tony Hill · Mar 8, 2004

Yup, I'm gonna have to do all of this from my Deja...er...Google
Groups account for the next several weeks. Speed of posting will also
be consistently lower, to reflect my demotion to dial-up Internet
connections for the next several weeks.

That's one thing about countries new to the Internet, they've never
heard of the older parts of the Internet like Usenet, so everything up
here is web-based. Uggh.

What do you mean by "classic sense" vs. any other sense? It's either
cache coherent or it's not.

Black Widow's architecture still hasn't been fully disclosed by Cray
yet, except to give a very highly undetailed statement that it is
cache-coherent.

Yeah, that's kind of what I'm getting at. Neither Octigabay or the
Craw Red Storm/Black Widow thing are really cache coherent the way
that we would usually think of cache coherency. However they do have
a couple statements that claim some form of cache coherency. We had a
discussion about this a little while back, and I believe we figured
that the "coherency" only came in the fact that all remote memory
requests went through the processor and it would do coherency checks
at that time.

Which I think some of the Intel fanboys were criticizing not so long
ago as one of the weaknesses of AMD64. Why aren't there peripherals
directly using Hypertransport instead of going through the PCI buses?
Well, it looks like there definitely are such peripherals now.
Extremely high speed ones at that.

Yup. Only a few special cases, but the potential is definitely there.
AMD did document this a while back, suggesting that custom chips could
connect directly to hypertransport links in some of their early Hammer
presentations. The possibility is definitely there, though the cost
of developing such a chip probably isn't worth it for all except a
tiny few cases. Going through an HT to PCI-X bridge and connecting
your I/O devices through PCI-X is probably fast enough for almost
everything and certainly much cheaper.

Felger Carbon · Mar 8, 2004

Black Jack said:
Tony Hill <[email protected]> wrote in message

Black Widow's architecture still hasn't been fully disclosed by Cray
yet, except to give a very highly undetailed statement that it is
cache-coherent.

Yousuf, enough of Red Storm's (and hence its Black Widow
interconnection scheme *has* been revealed to fully support Cray's
statement that Red Storm is cache-coherent.

The mechanism to support cache-coherency is **exactly** the same as on
your desktop PC when the disk does a DMA into DRAM. No difference
whatsoever. Honest injun.

Let me expand on this for some of the other less techie NG readers
(not you and Tony). Red Storm has over 10K CPUs. It may seem to you
that a write to one CPU's memory space must be simultaneously snooped
by each and every other CPU's cache to maintain cache coherency. This
is not the case at all.

Each of those 10K CPUs has its own memory space, not shared with any
other CPU. Each CPU's cache need only track external DMAs into that
*one* memory space, just as *your* PC's CPU has to track DMAs into its
memory space. In the case of Red Storm, the DMA in question is the
message passing interface (MPI), which uses DMA via the Black Widow
architecture.

An aside: at one time I believed that "Black Widow" referred to the
I/O chip, of which there is one for each CPU. This is not the case.
"Black Widow" evidently refers to the message-passing mesh
architecture, and especially to the ability to segment the Red Storm
system into "black" and "red" divisions. A black widow spider is all
black except for red markings on its belly.

Robert Myers · Mar 8, 2004

Each of those 10K CPUs has its own memory space, not shared with any
other CPU. Each CPU's cache need only track external DMAs into that
*one* memory space, just as *your* PC's CPU has to track DMAs into its
memory space. In the case of Red Storm, the DMA in question is the
message passing interface (MPI), which uses DMA via the Black Widow
architecture.

At the risk of revealing simultanoeously to the world that I haven't
read Hennessey and Patterson cover to cover (and I haven't--any
edition, mind you) and that I haven't read every scrap of available
Red Storm documentation, I will readily admit that there is a piece of
this I either don't understand or that makes me even more certain of
the foolishness of dense mesh networks.

There are two possible approaches to using remote data that I can
think of:

Approach No. 1: leave the data in place and do RDMA reads and writes
to the remote memory location every time you touch the data. Result:
perfect cache coherency, ungodly latency and endless (and
hard-to-predict) traffic jams on the dense mesh. Remote latencies
are, at a minimum, twenty times typical local latencies. If you have
to get 100 instructions in flight to keep an OoO processor from
becoming stalled with local data, then you need to keep 20*100
instructions in flight to keep an OoO processor from getting stalled
with remote data.

Approach No. 2: Copy the data into your own memory space, manipulate
it there, and copy it back when done. Result: more than one copy of
the data exists at a time, and you have to use something like locks to
keep the data from becoming inconsistent. Only one processor can use
the data at a time, waits to get access to the data could be very
long, and the notion of ccNUMA is no more than a marketer's gimmick,
and, to correct my previous opinion on the subject, significantly less
useful than Hyperthreading in actual practice.

RM

Felger Carbon · Mar 8, 2004

Robert Myers said:
Only one processor can use
the data at a time, waits to get access to the data could be very
long, and the notion of ccNUMA is no more than a marketer's gimmick.

As Robert correctly points out, there are problems associated with
using lotsa COTS microprocessors as the basis for a supercomputer
(e.g. Red Storm). These problems are real.

All of us would rather have one processor and memory system that is
10,000 times faster than an Opteron and its DRAM memory. Alas, such a
device is not available in the real world. In the real world, the
only alternative that's available is the vector processor a la Japan's
Earth Simulator, at ~$600M per copy.

The available evidence suggests that most folks with a checkbook
believe the 10K+ COTS approach provides a better tradeoff than a
vector machine. Neither my opinion nor Robert's counts since neither
of us owns a large enough checkbook.

I respectfully disagree with Robert about ccNUMA being a marketing
gimmick. Red Storm **is** cache coherent. This is a fact, not an
opinion. Robert is free to suggest that ccNUMA is not a panacea -
nobody claims it is - but IMHO it's more than a gimmick.

BTW - on slide 8 (of 18) on Cray's Red Storm PDF, the "system I/O"
chip is named the "Seastar". So Seastar is the chip and Black Widow
is the mesh architecture, not a chip.

Robert Myers · Mar 8, 2004

As Robert correctly points out, there are problems associated with
using lotsa COTS microprocessors as the basis for a supercomputer
(e.g. Red Storm). These problems are real.

All of us would rather have one processor and memory system that is
10,000 times faster than an Opteron and its DRAM memory. Alas, such a
device is not available in the real world. In the real world, the
only alternative that's available is the vector processor a la Japan's
Earth Simulator, at ~$600M per copy.

The available evidence suggests that most folks with a checkbook
believe the 10K+ COTS approach provides a better tradeoff than a
vector machine. Neither my opinion nor Robert's counts since neither
of us owns a large enough checkbook.

I respectfully disagree with Robert about ccNUMA being a marketing
gimmick. Red Storm **is** cache coherent. This is a fact, not an
opinion. Robert is free to suggest that ccNUMA is not a panacea -
nobody claims it is - but IMHO it's more than a gimmick.

In order for cache coherency to make sense as a useful concept (world
according to RM, obviously), remote latencies have to fit the
requirement that another poster imposed on them in comp.arch: they
have to be comparable to local latencies. Such a requirement on
remote latencies is in general unreasonable and unattainable for MPP,
but that also means that cache coherency (world according to RM,
obviously) for a NUMA supercomputer is not a useful concept. That's
in contrast to AMD's original concept of a small (up to eight way)
cluster, where remote latencies are a small multiple of local
latencies, and the idea of ccNUMA is a useful concept. All of this
with a possible caveat.

The caveat has to do with the actual mechanics of message passing
and/or RDMA. I don't think that remote memory reads and writes for
RedStorm are necessarily limited to MPI, and even if they were,
writing to a (remote) memory location is surely a lower-overhead
operation than writing to an I/O socket. At this level of detail, I
am more than happy to admit that I don't really know what I'm talking
about.

Were I posting to newsgroups in Japanese, I probably would have been
jumping up and down, hooting and hollering about the economics of
Earth Simulator. We don't know what the economics of the Cray SV-2
aka X-1 would be if it ever achieved significant market volume, but
it's all speculation, since such a machine is probably never going to
achieve significant market volume.

A dense mesh network with one router and one garden-variety processor
per compute node (the architecture of both Blue Gene and Red Storm)
and an Earth Simulator style vector processor are not the only
possibilities. The Cray X-1 (aka SV2) is significantly more cost
effective than ES. NSA special order machines like the X-1 probably
won't make much of a dent in HPC even if a place like ORNL
occasionally breaks down and buys one, but that doesn't mean that
streaming architectures won't. Whether the DoE (which always has the
biggest checkbook) picks up on streaming architectures or not,
somebody else will.

Always, of course, with the greatest of respect.

RM

Felger Carbon · Mar 9, 2004

Robert Myers said:
I don't think that remote memory reads and writes for
RedStorm are necessarily limited to MPI

My understanding is that MPI is the *only* way for the Red Storm CPUs
to communicate with other CPUs or anything else. One of us must be
wrong here, Robert. Cray's PDF repeatedly - and only - refers to Red
Storm as being an "MPI" machine.

Whether the DoE (which always has the
biggest checkbook) picks up on streaming architectures or not,
somebody else will.

Really? Who? And when? I confess that my impression is that a
streaming architecture is a chip-level implementation of an algorithm.
Change the algorithm, design a new chip. How long did it take to
develop the Opteron? Itanium? Intel's NetBurst CPU? Do I have this
wrong?

I trust the readers following this thread have noted that I'm in
agreement with **almost** all of your points. ;-)

Robert Myers · Mar 9, 2004

My understanding is that MPI is the *only* way for the Red Storm CPUs
to communicate with other CPUs or anything else. One of us must be
wrong here, Robert. Cray's PDF repeatedly - and only - refers to Red
Storm as being an "MPI" machine.

The distinction may be entirely academic, as the latest implementation
of MPI, MPICH2, aims at exploiting RDMA

http://www-unix.mcs.anl.gov/mpi/mpich2/

http://www.gup.uni-linz.ac.at/pvmmpi/talks/gropp.pdf

MPI is software. The actual message-passing of MPI has to be
implemented somehow at the physical link level. On a shared-memory
architecture, messages can be passed through shared memory, without
using any I/O at all. I would guess that the original implementations
of MPI for clusters went through the entire TCP/IP stack or its
equivalent, with all the associated overhead. My interpretation of
Red Storm being characterized as an MPI machine (without knowing what
exactly you're referring to) is that nodes communicate by transmitting
and receiving packets of information (and not through shared memory).

It may well be that MPI will be the only communication mode for which
software is ever developed for Red Storm, but whatever link layer is
used by MPI could just as well be used by some other communication
protocol implemented in software.

Really? Who? And when? I confess that my impression is that a
streaming architecture is a chip-level implementation of an algorithm.
Change the algorithm, design a new chip. How long did it take to
develop the Opteron? Itanium? Intel's NetBurst CPU? Do I have this
wrong?

There is such a thing as a programmable stream processor:

http://merrimac.stanford.edu/

http://www.sc-conference.org/sc2003/paperpdfs/pap246.pdf

The work is currently being funded by DARPA (or it was last time I
looked).

Trying to make a distinction between a true stream processor and a
garden-variety modern microproessor gets harder all the time as
garden-variety microprocessors implement many of the ideas of stream
parallelism with or without explicit hardware support and with or
without explicit compiler support (streaming apparently just falls out
of OoO scheduling and register bypass).

As to how long it takes to develop a CPU, most of the complication and
cost in developing a modern microprocessor comes from the generality
and complexity of on-die scheduling. Most of that complexity would be
missing from a programmable stream processor.

Most of us have working examples of a stream processor right in our
own computers in the form of a GPU.

RM

Black Jack · Mar 9, 2004

Robert Myers said:
In order for cache coherency to make sense as a useful concept (world
according to RM, obviously), remote latencies have to fit the
requirement that another poster imposed on them in comp.arch: they
have to be comparable to local latencies. Such a requirement on
remote latencies is in general unreasonable and unattainable for MPP,
but that also means that cache coherency (world according to RM,
obviously) for a NUMA supercomputer is not a useful concept. That's
in contrast to AMD's original concept of a small (up to eight way)
cluster, where remote latencies are a small multiple of local
latencies, and the idea of ccNUMA is a useful concept. All of this
with a possible caveat.

Now that Cray has bought Octigabay, I understand from a thread on
comp.arch that Octigabay's machines have an interconnect FPGA that
participates directly in the ccHT fabric. If it's participating in
ccHT, then it's likely using a directory-based coherency mechanism
between servers.

However, I haven't really found any reference to cache-coherent
Hypertransport in these papers (although I haven't really looked
hard):

http://www.octigabay.com/downloads/Whitepaper_Closing_the_Gap.pdf

An ironic twist to this would be that if Cray produces both an
Octigabay 12K system and a Red Storm system, then the high-end Red
Storm would use low-end Opteron 1xx's, while the low-end Octigabay
systems would need to use high-end Opteron 8xx's. :-)

Yousuf Khan

Felger Carbon · Mar 9, 2004

Robert Myers said:
Trying to make a distinction between a true stream processor and a
garden-variety modern microproessor gets harder all the time as
garden-variety microprocessors implement many of the ideas of stream
parallelism with or without explicit hardware support and with or
without explicit compiler support (streaming apparently just falls out
of OoO scheduling and register bypass).

Let's see: we should not build supercomputers out of GVMs, yet GVMs
are rapidly approaching the RM-nirvana of streaming architecture?
Huh?

Most of us have working examples of a stream processor right in our
own computers in the form of a GPU.

Yep! Lots and lots of transistors, lousy precision, a *single*
algorithm, and a million+ users of that one algorithm. Absolutely a
perfect application for streaming architecture. Applause. Also, I'm
surprised you have never mentioned the Japanese Grape project, which
apparently does have double precision and a streaming architecture,
but for an embarrassingly simplistic algorithm - a dot product!

I suspect there's a good reason why streaming architectures aren't
**commonly** used in general purpose scientific calculations. I agree
that the concept works. I don't think the concept makes financial
sense (bang/buck).

Robert Myers · Mar 9, 2004

Let's see: we should not build supercomputers out of GVMs, yet GVMs
are rapidly approaching the RM-nirvana of streaming architecture?
Huh?

If you have a microprocessor that can, in some sense, mimic a stream
processor but can also do other things, you must be paying a price for
that extra functionality.

If the price you are paying (energy and heat removal costs being the
decisive factors) is a factor of two or less, then you enjoy the
generality of ordinary microprocessors and pay the price.

If the price you are paying is more like a factor of ten or more, and
especially if it looks like what you'd really like to do is going to
be impossible with ordinary microprocessors, you might want to think
about facing up to the difficulty of doing whatever it is you want to
do as a streaming process.

Yep! Lots and lots of transistors, lousy precision, a *single*
algorithm, and a million+ users of that one algorithm. Absolutely a
perfect application for streaming architecture. Applause.

I should think so. GPU's are programmable and people have done rather
amazing things with GPU's. Doubling the precision is a no-brainer if
it can be shown to be worth the cost.

Also, I'm
surprised you have never mentioned the Japanese Grape project, which
apparently does have double precision and a streaming architecture,
but for an embarrassingly simplistic algorithm - a dot product!

http://www.research.ibm.com/grape/

'Upon hearing the word "MD-GRAPE", most of the general public would
respond, "MD-what?". The IBM Research Division and the Institute of
Chemical Research (RIKEN) in Tokyo have been collaborating during the
last few years to produce an accelerator chip (pictured left) that can
rapidly calculate all of the interatomic forces in a molecular
dynamics simulation with millions of particles -- a task that would
take a room full of conventional computers.'

I suspect there's a good reason why streaming architectures aren't
**commonly** used in general purpose scientific calculations.

Inertia? Timidity?

I agree
that the concept works. I don't think the concept makes financial
sense (bang/buck).

Bang/buck for specialized solutions is almost entirely determined by
fixed NRE bucks divided by highly variable bang that depends on how
many times you exercise the specialized solution.

If no one else has, some part of IBM has taken a good, hard look at
the entire range of possibilities and attempted at least some degree
of implementation of just about every possible approach. If it were
an obvious slam-dunk, IBM would be getting ready to punch out stream
processors by the zillion. Is it obvious that that's not what's
happening?

http://www-306.ibm.com/chips/news/2001/0312_sony-toshiba.html

RM

Robert Myers · Mar 9, 2004

On 8 Mar 2004 21:38:41 -0800, (e-mail address removed) (Black
Jack) wrote:

An ironic twist to this would be that if Cray produces both an
Octigabay 12K system and a Red Storm system, then the high-end Red
Storm would use low-end Opteron 1xx's, while the low-end Octigabay
systems would need to use high-end Opteron 8xx's.

I haven't really been able to give the whole Octigabay thing the
attention it deserves. I'm not sure that a switched network of
Octigabay solutions wouldn't be my choice over a Red Storm cluster for
a "high end" solution.

RM

Black Jack · Mar 10, 2004

Robert Myers said:
I haven't really been able to give the whole Octigabay thing the
attention it deserves. I'm not sure that a switched network of
Octigabay solutions wouldn't be my choice over a Red Storm cluster for
a "high end" solution.

Also Sun just recently purchased a mysterious little startup called
Kealia, which was apparently also working on some kind of Opteron
superclusters.

Yousuf Khan

Cray to buy Octigabay

Black Jack

Tony Hill

Black Jack

Tony Hill

Felger Carbon

Robert Myers

Felger Carbon

Robert Myers

Felger Carbon

Robert Myers

Black Jack

Felger Carbon

Robert Myers

Robert Myers

Black Jack