SGI takes Itanium & Linux to 1024-way

  • Thread starter Thread starter Yousuf Khan
  • Start date Start date
Yousuf said:
A single Linux image running across a 1024 Itanium processor machine.

http://www.computerworld.com/hardwaretopics/hardware/story/0,10801,94564,00.html

"The users get one memory image they have to deal with," he [Pennington,
the interim director of NCSA] said. "This makes programming much easier,
and we expect it to give better performance as well."

Too early to call it a trend, but I'm encouraged to see the godfather of
the "Top" 500 list talking some sense as well:

callysto.hpcc.unical.it/ hpc2004/talks/dongarra-survey.ppt

slides 37 and 38.

A single system image is no simple cure. It may not be a cure at all.
But it's encouraging that somebody is taking it seriously enough to
build a kilonode machine with a single address space.

"Scalability" being a challenge for such installations (you can't just
order more boxes and more cable and take another rural county out of
agricultural production to move "up" the "Top" 500 list) the premium is
on processors with high single-thread throughput.

RM
 
Robert Myers wrote:

[SNIP]
A single system image is no simple cure. It may not be a cure at all.
But it's encouraging that somebody is taking it seriously enough to
build a kilonode machine with a single address space.

Hats off to SGI, kilonode ssi is a neat trick. :)

Let's say you write code that makes use of a large single system
image machine. Let's say SGI fall behind the curve and you need
answers faster : Where can you go for another large single system
image machine ?

I see that kind of awfully clever machine as vendor lock-in waiting
to happen. If you want to avoid lock-in you end up writing your
code to the lowest common denominator, and in this case that will
probably remove any advantage gained by SSI (application depending
of course).

Cheers,
Rupert
 
Rupert said:
Let's say you write code that makes use of a large single system
image machine. Let's say SGI fall behind the curve and you need
answers faster : Where can you go for another large single system
image machine ?

What curve are we keeping up with these days?

The difference in scalability between the Altix and Blue Gene is
interesting mostly if you're trying to hit arbitrarily definied
milestones in a Gantt chart.

For hydro, a factor of ten in machine size is a 78% increase in number
of grid points available to resolve a given scale: whoop-de-ding. Maybe
there's something different about actinide-lanthanide decay series
that's worth understanding. I'll get around to it some time--even
though I strongly suspect I'm being led on a wild goose chase. The real
justification for the milestones on the Gantt chart of the last of the
big spenders is that a petaflop is a nice big round number for a goal.
I see that kind of awfully clever machine as vendor lock-in waiting
to happen. If you want to avoid lock-in you end up writing your
code to the lowest common denominator, and in this case that will
probably remove any advantage gained by SSI (application depending
of course).

Blue Gene is now not awfully clever? :-).

Commodity chip, flat address space. That sounds pretty vanilla to me.
How do you get more common than that? You can get an Itanium box with a
flat address space to your own personal work area much more readily than
you can get a Blue Gene.

There is no way not to leave you with the idea that I think single-image
machines are the way to go. I don't know that, and I'm not even certain
what course of investigation I would undertake to decide whether the way
to go or not. What I like about the single address space is that it
would appear to make the minimum architectural imposition on problem
formulation.

RM
 
Robert said:
How do you get more common than that? You can get an Itanium box with a
flat address space to your own personal work area much more readily than
you can get a Blue Gene.

Extend that argument further and you are buying Xeons.

The point is 1000 node machines with shared address spaces don't
fall out of trees. Who said anything about BlueGene anyways ?
There is no way not to leave you with the idea that I think single-image
machines are the way to go. I don't know that, and I'm not even certain

Over the long run I think it will be very hard to justify the
extra engineering and purchase cost over message passing gear.
what course of investigation I would undertake to decide whether the way
to go or not. What I like about the single address space is that it
would appear to make the minimum architectural imposition on problem
formulation.

People made the similar argument for CISC machines too. VAX
polynomial instructions come to mind. :)

Cheers,
Rupert
 
Rupert said:
Extend that argument further and you are buying Xeons.

There is a fair question that could be asked for almost any application
these days: why not ia-32 (probably with 64-bit extensions). When
you've got superlinear interconnect costs, you want each node to be as
capable as possible. The application of that argument to Itanium in
this particular argument is wobbly, since the actual usefulness of
Itanium may be just as theoretical as the usefulness of the clusters
I've been worrying about.
The point is 1000 node machines with shared address spaces don't
fall out of trees. Who said anything about BlueGene anyways ?

I did. Blue Gene was the best contrast I could think of to a single
image Itanium machine in terms of cost, energy efficiency, and
scalability. There is no fundamental reason why BlueGene couldn't
become widely used and accepted, but it probably won't be because it
won't show up in the workspace of your average graduate student or
postdoc.

Your question is what do we do when we need more than 1000 nodes. It's
a fair question, but not the only one you could ask. My questions are:
where does the software that runs on the big machine come from, in what
environment was it developed, at what cost, and with what opportunities
for continued development.
Over the long run I think it will be very hard to justify the
extra engineering and purchase cost over message passing gear.

Hardware is cheap, software is expensive. If we've run out of
interesting things to do with making processors astonishingly powerful
and inexpensive, we certainly haven't run out of interesting things to
do in making interconnect astonishingly powerful and inexpensive.
People made the similar argument for CISC machines too. VAX
polynomial instructions come to mind. :)

The RISC/CISC argument went away when microprocessors were developed
that could hide RISC execution behind a CISC programming model. The
neat hardware insight (RISC) did not, in the end, impose itself on
applications. No more should a particular hardware reality about
multi-processor machines impose itself on applications.

RM
 
Robert said:
Rupert Pigott wrote:
[SNIP]

I did. Blue Gene was the best contrast I could think of to a single
image Itanium machine in terms of cost, energy efficiency, and
scalability. There is no fundamental reason why BlueGene couldn't
become widely used and accepted, but it probably won't be because it
won't show up in the workspace of your average graduate student or postdoc.

Cluster style systems should be fairly easy to come by at that level.

Apps written for clusters should port to another cluster system easier
than apps written for a shared memory system to a cluster. It's a
matter of choice over the long run... If you use the unique features
of a kilonode Itanium box then you're basically locked-in. Clearly
this is not an issue for some establishments, Cray customers are a
good example. :P
Your question is what do we do when we need more than 1000 nodes. It's
a fair question, but not the only one you could ask. My questions are:
where does the software that runs on the big machine come from, in what
environment was it developed, at what cost, and with what opportunities
for continued development.

Of course. Like I said I don't see 1000 node ssi machines falling
out of trees. I do see depts assembling a few hundred beige boxes
and a nightmare of hodgepodge of switches though. :)
Hardware is cheap, software is expensive. If we've run out of
interesting things to do with making processors astonishingly powerful
and inexpensive, we certainly haven't run out of interesting things to
do in making interconnect astonishingly powerful and inexpensive.

I think that is well in hand to be honest. Plenty of options there
but everyone ends up going Ethernet anyways. :P
The RISC/CISC argument went away when microprocessors were developed
that could hide RISC execution behind a CISC programming model. The

I doubt this would have gone very far without the highly visible
captive market (ie: WINTEL desktop).

I have indirectly acknowledged that NUMA machines can in principle do
the same job as clusters (it would be silly not to). They implement a
superset of the comms functionality required by clusters.

HOWEVER... There are some differences in the market place. The captive
market is rather small, there is less money to develop whizz bang
solutions and amortize the cost than there was with x86. What we have
seen are folks who can't afford to splash a few $1m on a box building
clusters that are "good enough" and that is how the market has been
broadened. The SSI machines don't have the stranglehold on the market
that x86 did.

Opteron is interesting because it is sort of halfway there, but still
constrained to small processor count machines. HT is not a spec that
you can DL over the web and peruse at your leisure, but I note that
folks who have are not particularly happy with it's error handling.
Apparently it just crashes and burns so you have to reset the sucker,
which is UNACCEPTABLE on a SSI machine. Think of the fun^Wchallenge
in handling that in the OS and applications.

If they fix HT or someone proves to me that it recovers fine, then
maybe my opinion. In the meanwhile I think interconnect for beige
boxes will get even better, as icky as that might be... And yes, I
*do* understand that some apps really don't fit clusters well, fair
play, go tug your forelock at SGI's door. :)

Cheers,
Rupert
 
Rupert said:
Robert said:
Rupert Pigott wrote:

[SNIP]

I did. Blue Gene was the best contrast I could think of to a single
image Itanium machine in terms of cost, energy efficiency, and
scalability. There is no fundamental reason why BlueGene couldn't
become widely used and accepted, but it probably won't be because it
won't show up in the workspace of your average graduate student or
postdoc.


Cluster style systems should be fairly easy to come by at that level.

They are, indeed, and they are widely used.
Apps written for clusters should port to another cluster system easier
than apps written for a shared memory system to a cluster.

You are apparently arguing for the desirability of folding the
artificial computational boundaries of clusters into software. If
that's a necessity of life, I can learn to live with it, but I'm having
a hard time seeing it as desirable. We are so fortunate as to live in a
universe that presents itself to us in midtower-sized chunks? I'm
worried. ;-).
It's a
matter of choice over the long run... If you use the unique features
of a kilonode Itanium box then you're basically locked-in. Clearly
this is not an issue for some establishments, Cray customers are a
good example. :P

Can you give an example of something that you think would happen?
Of course. Like I said I don't see 1000 node ssi machines falling
out of trees. I do see depts assembling a few hundred beige boxes
and a nightmare of hodgepodge of switches though. :)

Well, you said it; I didn't. If you have an environment where more
flops are an end in themselves--and we do have such an environment--then
you don't have to worry about how much productivity your nightmare
produces as long as the photo in the alumni newsletter looks convincing.
Even more depressing, if your goal is to crank out papers and Ph.D.
theses, you may do pretty well with beige boxes and cheap labor and have
very little impact on applied science and technology, because people
trying to solve real world problems can't wait for a grad student and a
post doc to spend a semester getting the cluster shaken down, and even
if they could it wouldn't make any economic sense because the labor
costs are too high.
I think that is well in hand to be honest. Plenty of options there
but everyone ends up going Ethernet anyways. :P

I really do think, now that PCI Express is here, that the day of
infiniband, at least for this particular space, is finally at hand.

I was actually imagining that there is really nothing to keep the
prerequisites for a single image box from becoming more of a commodity.

HOWEVER... There are some differences in the market place. The captive
market is rather small, there is less money to develop whizz bang
solutions and amortize the cost than there was with x86. What we have
seen are folks who can't afford to splash a few $1m on a box building
clusters that are "good enough" and that is how the market has been
broadened. The SSI machines don't have the stranglehold on the market
that x86 did.
I take the current market fragmentation as confirmation of my world view
that none of the tools we currently possess are really all that good. ;-).

There is a national lab presentation that argues rather touchingly that
supercomputers really can produce results that are qualitatively better
than workstations. You think that successful bureaucrat would even have
brought it up if he hadn't been challenged on the matter?

The optimistic view is that the chaos we currently see is the HPC
equivalent of the pre-Cambrian explosion and that natural selection will
eventually give us a mature and widely-adopted architecture. My purpose
in starting this discussion was simply to opine that single image
architectures have some features that make them seem promising as a
survivor--not a widely-held view, I think.
Opteron is interesting because it is sort of halfway there, but still
constrained to small processor count machines. HT is not a spec that
you can DL over the web and peruse at your leisure, but I note that
folks who have are not particularly happy with it's error handling.
Apparently it just crashes and burns so you have to reset the sucker,
which is UNACCEPTABLE on a SSI machine. Think of the fun^Wchallenge
in handling that in the OS and applications.

If they fix HT or someone proves to me that it recovers fine, then
maybe my opinion. In the meanwhile I think interconnect for beige
boxes will get even better, as icky as that might be... And yes, I
*do* understand that some apps really don't fit clusters well, fair
play, go tug your forelock at SGI's door. :)

Geez, Rupert, they couldn't possibly be as bad as IBM used to be. :-).
I can live with clusters. It may be that living with clusters is an
inevitable necessity. I'm not yet ready to give up on a single address
space, though.

RM
 
Robert said:
Rupert said:
Robert said:
Rupert Pigott wrote:


[SNIP]

I did. Blue Gene was the best contrast I could think of to a single
image Itanium machine in terms of cost, energy efficiency, and
scalability. There is no fundamental reason why BlueGene couldn't
become widely used and accepted, but it probably won't be because it
won't show up in the workspace of your average graduate student or
postdoc.



Cluster style systems should be fairly easy to come by at that level.


They are, indeed, and they are widely used.
Apps written for clusters should port to another cluster system easier
than apps written for a shared memory system to a cluster.


You are apparently arguing for the desirability of folding the
artificial computational boundaries of clusters into software. If

That happens with SSI systems too. There is a load of information that
has been published about scaling on SGI's Origin machines over the
years. IIRC Altix is based on the same Origin 3000 design. You may
remember that I quizzed Rob Warnock on this, he said that there were
in practice little gotchas that tend to crop up at particular #'s of
procs. He even noted that the gotcha processor counts tended to change
with the particular generation of Origin.
that's a necessity of life, I can learn to live with it, but I'm having
a hard time seeing it as desirable. We are so fortunate as to live in a
universe that presents itself to us in midtower-sized chunks? I'm
worried. ;-).

In my mind it's a question of fitting our computing effort to reality
as opposed to living in an Ivory Tower. Some goals, while worthy,
desirable, or even partially achievable, are basically impossible to
achieve in reality. A genuinely *flat* address space is impossible
right here and now. That SSI Altix box will *not* have *flat* address
space in terms of time. It is a NUMA machine. :)
Can you give an example of something that you think would happen?

Depends on the app. Stuff like memory mapping one large file for read
and occasional write could cause some fantastic locking + latency
issues when it comes to porting. :)

[SNIP]
Even more depressing, if your goal is to crank out papers and Ph.D.
theses, you may do pretty well with beige boxes and cheap labor and have
very little impact on applied science and technology, because people
trying to solve real world problems can't wait for a grad student and a
post doc to spend a semester getting the cluster shaken down, and even
if they could it wouldn't make any economic sense because the labor
costs are too high.

Shaking down large + fast machines has traditionally been a costly
and risky business. Look at all those machines that spent hours
with grads all over them and didn't really make an impact, thinking
of stuff like the bigger ETAs, TM-5s didn't seem to do much either.

Shaking down Crays took some time too, although to be fair they do
have a good rep for reliability once setup. However Crays are toys
by comparison to contemporary big systems (component count etc)...

In terms of sorting out clusters and stuff there is obviously a
niche there, from what I read it appears to be getting filled too.
I really do think, now that PCI Express is here, that the day of
infiniband, at least for this particular space, is finally at hand.

Yeah, interconnect is catching up at bloody last. You will always
have latency problems while we're communcating < c m/s though,
regardless of whether you present your network to the application
as a single address space or not.
I was actually imagining that there is really nothing to keep the
prerequisites for a single image box from becoming more of a commodity.

I mentioned Opteron, if HT really does suffer from crash+burn on
comms failure then it is holding itself back. If that ain't the
case I'd have figured that a tiny form factor Opteron + DRAM +
router cards would be a reasonable component for high-density
clusters and beige SSI machines. You'd need some facility for
driving some links for longer distances than HT currently allows
too ($$$). The next thing holding you back is tuning the OS + Apps
to a myriad of possible configurations... :(

[SNIP]
The optimistic view is that the chaos we currently see is the HPC
equivalent of the pre-Cambrian explosion and that natural selection will
eventually give us a mature and widely-adopted architecture. My purpose
in starting this discussion was simply to opine that single image
architectures have some features that make them seem promising as a
survivor--not a widely-held view, I think.

I'm sure they'll have their place. But in the long run I think that
PetaFLOP pressure will tend to push people towards message passing
style machines. Consdier this though : Internet is becoming more and
more prominent on daily life. The Spooks must have a fair old time
keeping up with the sheer volume of data flowing around the globe.
Distributed processing is a natural fit here, SSI machines just would
not make sense. More and more governments and their civil servants
will want to make use of this surveillance resource too, check out
the rate at which legislation is legitimising their intrusion on the
individual's privacy. The War on Terror has added more fuel to that
growth market too. :)
Geez, Rupert, they couldn't possibly be as bad as IBM used to be. :-).

Probably not because they are a niche player beholden to a few very
powerful customers.
I can live with clusters. It may be that living with clusters is an
inevitable necessity. I'm not yet ready to give up on a single address
space, though.

Fair enough. Just don't hold your breath waiting for a kilonode SSI
machine to fall into your lap. :)

Cheers,
Rupert
 
Rupert said:
That happens with SSI systems too. There is a load of information that
has been published about scaling on SGI's Origin machines over the
years. IIRC Altix is based on the same Origin 3000 design. You may
remember that I quizzed Rob Warnock on this, he said that there were
in practice little gotchas that tend to crop up at particular #'s of
procs. He even noted that the gotcha processor counts tended to change
with the particular generation of Origin.



In my mind it's a question of fitting our computing effort to reality
as opposed to living in an Ivory Tower. Some goals, while worthy,
desirable, or even partially achievable, are basically impossible to
achieve in reality. A genuinely *flat* address space is impossible
right here and now. That SSI Altix box will *not* have *flat* address
space in terms of time. It is a NUMA machine. :)

Well, yes, it is. The spread in latencies is more like a half a
microsecond, as opposed to five microseconds for the latest and greatest
of the DoE build-to-order specials.

On the question of Ivory Towers vs. reality, I believe that I am on the
side of the angels, naturally. If you believe the right question really
is: "What's the least expensive way we can get a high Linpack score?",
then clusters are a slam dunk, but I don't think that anybody worth
talking to on the subject really thinks that's the right question to be
asking.

As to access to 1000-node and even bigger machines, I don't need them.
What I need is to know what kind of machine a code is likely to run on
when somebody decides an NCSA-type installation is required.

How you will _ever_ scale _anything_ to the kinds of memory and and
compute requirements required to do even some very pedestrian problems
properly is my real concern, and, from that point of view, no
architecture currently on the table, short of specialized hardware, is
even in the right universe.

Given that _nothing_ currently available can really do the physics
right--with the possible exception of things like the Cell-like chips
the Columbia QCD people are using--and that nothing currently available
really scales in a way that I can imagine, I'm inclined to give heavy
emphasis to useability.
Depends on the app. Stuff like memory mapping one large file for read
and occasional write could cause some fantastic locking + latency
issues when it comes to porting. :)

I understand just enough about operating systems to know that building a
1000-node image that runs on realizable hardware is a real
tour-de-force. I also understand that you can take off-the-shelf copies
of, say, RedHat Linux, and some easily-obtainable clustering software
and (probably) get a thousand beige boxes to run like a kilonode
cluster. Someone else (Linus, SGI, et al) wrote the Altix OS. Someone
else (Linus, RedHat, et al) wrote the OS for the cluster nodes. I don't
want to fiddle with either one. You want me to believe that I am better
off synchronizing processes and exchanging data across infiniband stacks
and through trips in and out of kernel and user space and with heaven
only knows how many control handoffs for each exchange than I am reading
and writing to my own user space under the control of a single OS, and I
just don't.

I mentioned Opteron, if HT really does suffer from crash+burn on
comms failure then it is holding itself back. If that ain't the
case I'd have figured that a tiny form factor Opteron + DRAM +
router cards would be a reasonable component for high-density
clusters and beige SSI machines. You'd need some facility for
driving some links for longer distances than HT currently allows
too ($$$). The next thing holding you back is tuning the OS + Apps
to a myriad of possible configurations... :(

I'm guessing that, the promise of Opteron for HPC notwithstanding, HT is
going to be marginalized by PCI Express/Infiniband.
[SNIP]
The optimistic view is that the chaos we currently see is the HPC
equivalent of the pre-Cambrian explosion and that natural selection
will eventually give us a mature and widely-adopted architecture. My
purpose in starting this discussion was simply to opine that single
image architectures have some features that make them seem promising
as a survivor--not a widely-held view, I think.


I'm sure they'll have their place. But in the long run I think that
PetaFLOP pressure will tend to push people towards message passing
style machines. Consdier this though : Internet is becoming more and
more prominent on daily life. The Spooks must have a fair old time
keeping up with the sheer volume of data flowing around the globe.
Distributed processing is a natural fit here, SSI machines just would
not make sense. More and more governments and their civil servants
will want to make use of this surveillance resource too, check out
the rate at which legislation is legitimising their intrusion on the
individual's privacy. The War on Terror has added more fuel to that
growth market too. :)
Nothing that _I_ say about distributed processing is going to slow it
down, that's for sure, and that isn't my intent. If you've got a
google-type task, you should use google-type hardware. Computational
physics is not a google-type task.

RM
 
Robert said:
Rupert Pigott wrote:
[SNIP]

Well, yes, it is. The spread in latencies is more like a half a
microsecond, as opposed to five microseconds for the latest and greatest
of the DoE build-to-order specials.

I find a claim of 500ns very hard to believe given the physical size of
the machine... I suppose they could cheat and slow down all accesses to
within 500ns of the worst case, but I don't believe SGI would compromise
in that way. The DoE build-to-order specials were considerably larger
when I looked at them last and that would make a significant difference
even before you took interconnect into account.
On the question of Ivory Towers vs. reality, I believe that I am on the
side of the angels, naturally. If you believe the right question really
is: "What's the least expensive way we can get a high Linpack score?",
then clusters are a slam dunk, but I don't think that anybody worth
talking to on the subject really thinks that's the right question to be
asking.

It's a question of which route is going to provide the solutions over
the long haul. NUMA/SSI has to solve the exact same problems as Message
Passing, just that it hides it from the programmer (in theory). As a
programmer I hate stuff that's swept under the carpet, as it usually
trips me up sometime later.

I had this debate with a friend who was convinced that threads were the
way of the future... He ran into a wall pretty quickly and decidede that
they were OK up to a point because he ended up having to go coding in a
message passing style despite using a thread mechanism. Performance and
maleability were the key issues for his relatively modest problem.
As to access to 1000-node and even bigger machines, I don't need them.
What I need is to know what kind of machine a code is likely to run on
when somebody decides an NCSA-type installation is required.

How you will _ever_ scale _anything_ to the kinds of memory and and
compute requirements required to do even some very pedestrian problems
properly is my real concern, and, from that point of view, no
architecture currently on the table, short of specialized hardware, is
even in the right universe.

Given that _nothing_ currently available can really do the physics
right--with the possible exception of things like the Cell-like chips
the Columbia QCD people are using--and that nothing currently available
really scales in a way that I can imagine, I'm inclined to give heavy
emphasis to useability.

Last time I checked BlueGene/L and QCD shared people, design and
expertese. No surprise they are similar to Cell in your estimation.
I don't really believe in silver bullets, I have come to accept that
there is no one true way to build MPP machines. Another way of putting
it is that General Purpose machinary usually sucks for pushing the
limits of a particular field.

[SNIP]
want to fiddle with either one. You want me to believe that I am better
off synchronizing processes and exchanging data across infiniband stacks

Hell yeah. Programmer nearly always has more domain knowledge than
the Compiler, OS, Interconnect and Processor. Why not use it ?
and through trips in and out of kernel and user space and with heaven
only knows how many control handoffs for each exchange than I am reading
and writing to my own user space under the control of a single OS, and I
just don't.

I don't think that is necessary. In fact I know it is not necessary,
I had 100+ processes per processor in a 300 node grid back in the 90s
and it was old hat then. No OS, no Ethernet, no TCP/IP, no Inifiband
was necessary.

TCP/IP & Ethernet (insert world+dog problem solving interconnect de
jour) uber alles is not helping anyone.

[SNIP]
I'm guessing that, the promise of Opteron for HPC notwithstanding, HT is
going to be marginalized by PCI Express/Infiniband.

Sigh... Just more guff in the way of sanity and lightweight comms.
If I was working at an outfit mucking with this kind of gear I'd wear
a T-Shirt with "CUT THE CRAP" on it. :)

SGI/Alpha 21364 get their latency figures by not trying to solve world
+ dog's problems with their interconnect. The interconnect is purpose
built for the job. The performance is *not* a function of NUMA/SSI, it
is a pre-requisite for NUMA/SSI. That is *precisely* where QCD/BlueGene
are coming from too. Think about it...


Cheers,
Rupert
 
Rupert said:
Robert Myers wrote:


[SNIP]
Well, yes, it is. The spread in latencies is more like a half a
microsecond, as opposed to five microseconds for the latest and
greatest of the DoE build-to-order specials.


I find a claim of 500ns very hard to believe given the physical size of
the machine... I suppose they could cheat and slow down all accesses to
within 500ns of the worst case, but I don't believe SGI would compromise
in that way. The DoE build-to-order specials were considerably larger
when I looked at them last and that would make a significant difference
even before you took interconnect into account.
A NASA press release from last November
http://www.arc.nasa.gov/aboutames-pressrelease.cfm?id=10000087 states
the worst-case communication latency to be "less than a microsecond" for
a 512-processor Altix. I've had my hands on sharper numbers, but I
can't find them on the instant. The physical size of the machine can't
be _that_ much of an issue, 3x10e8 m/s x 10e-6 s/us = 300m/us.

It's a question of which route is going to provide the solutions over
the long haul. NUMA/SSI has to solve the exact same problems as Message
Passing, just that it hides it from the programmer (in theory). As a
programmer I hate stuff that's swept under the carpet, as it usually
trips me up sometime later.

You have to lay the carpet somewhere. The question is: which details to
hide and which details to force the user to worry over. We may not
agree about what to hide, but you have to hide something, and people
have plenty enough to think about without adding details that really
have nothing to do with the actual calculation...
I had this debate with a friend who was convinced that threads were the
way of the future... He ran into a wall pretty quickly and decidede that
they were OK up to a point because he ended up having to go coding in a
message passing style despite using a thread mechanism. Performance and
maleability were the key issues for his relatively modest problem.

....and I find a featureless computational space attractive, even if the
featurelessness is factitious. If the programming models were settled
and the software tools mature, I might not think that making unnecessary
details visible weren't such an imposition, but the programming models
aren't settled and the software tools aren't mature.

I'd rather have someone lay a computational model on a plain background,
rather than justify the computational model as appropriate because it's
what the hardware dictates--and that _is_ what has happened with
BlueGene and RedStorm.

Hell yeah. Programmer nearly always has more domain knowledge than
the Compiler, OS, Interconnect and Processor. Why not use it ?

Because there are so many other things to think about.
and through trips in and out of kernel and user space and with heaven
only knows how many control handoffs for each exchange than I am
reading and writing to my own user space under the control of a single
OS, and I just don't.


I don't think that is necessary. In fact I know it is not necessary,
I had 100+ processes per processor in a 300 node grid back in the 90s
and it was old hat then. No OS, no Ethernet, no TCP/IP, no Inifiband
was necessary.

TCP/IP & Ethernet (insert world+dog problem solving interconnect de
jour) uber alles is not helping anyone.

[SNIP]
I'm guessing that, the promise of Opteron for HPC notwithstanding, HT
is going to be marginalized by PCI Express/Infiniband.


Sigh... Just more guff in the way of sanity and lightweight comms.
If I was working at an outfit mucking with this kind of gear I'd wear
a T-Shirt with "CUT THE CRAP" on it. :)

Lightweight threads, lightweight comms. All possible, I guess. The
people who have the resources to provide the leadership don't seem to
find the enterprise interesting. Look at the fight for survival
infiniband has had.
SGI/Alpha 21364 get their latency figures by not trying to solve world
+ dog's problems with their interconnect.
The interconnect is purpose
built for the job. The performance is *not* a function of NUMA/SSI, it
is a pre-requisite for NUMA/SSI. That is *precisely* where QCD/BlueGene
are coming from too. Think about it...

Exactly so. If you want to get a single image to run on a kilonode,
your comms have to be pretty slick. As a result, you won't have to pay
too much attention to lame stories about "nearest neighbor" comms. I
like that.

RM
 
Robert said:
Rupert Pigott wrote:
[SNIP]

A NASA press release from last November
http://www.arc.nasa.gov/aboutames-pressrelease.cfm?id=10000087 states
the worst-case communication latency to be "less than a microsecond" for
a 512-processor Altix. I've had my hands on sharper numbers, but I
can't find them on the instant. The physical size of the machine can't
be _that_ much of an issue, 3x10e8 m/s x 10e-6 s/us = 300m/us.

Stills sounds unlikely to me. There aren't any 512 node Altixen
falling out of the trees round here so I am unable to independantly
confirm or deny their results. :)
...and I find a featureless computational space attractive, even if the

Who doesn't ?
featurelessness is factitious. If the programming models were settled

Ah, now there we differ. I have been bitten by too many corner
cases, to much erroneous behaviour and far too often by insufficiently
spec'd systems. It's just not funny anymore, and the root cause 9/10
times is the vendor trying to do too much.
and the software tools mature, I might not think that making unnecessary
details visible weren't such an imposition, but the programming models
aren't settled and the software tools aren't mature.

Why should hardware solve that for software ? It's a software problem,
not a hardware one.
I'd rather have someone lay a computational model on a plain background,
rather than justify the computational model as appropriate because it's
what the hardware dictates--and that _is_ what has happened with
BlueGene and RedStorm.

I believe you have been pointed at papers that detail specific
applications that BlueGene was designed to solve. If the DoE dudes
want to do something different with it, so be it. It wasn't designed
in a vacuum.

[SNIP]
Because there are so many other things to think about.

Eh ? The whole point of a programmer is to fit the domain knowledge
to the tool (as far as that is possible).

The argument for simple hardware is that the programmer has *less*
corner cases to worry about and can spend less time fighting the
hardware. Simplicity has other benefits that the customer does not
see : It helps the Vendor do a more thorough validation of the
platform.

The amount of time I have spent working around errant HW and SW
and wished for something closer to the metal you would not believe.

This is from a guy who thinks C sucks for app programming too. :P
Lightweight threads, lightweight comms. All possible, I guess. The
people who have the resources to provide the leadership don't seem to
find the enterprise interesting. Look at the fight for survival
infiniband has had.

Infiniband is about 10,000,000 miles away from lightweight comms,
compare and contrast with IEEE-1355 for example.
Exactly so. If you want to get a single image to run on a kilonode,
your comms have to be pretty slick. As a result, you won't have to pay
too much attention to lame stories about "nearest neighbor" comms. I
like that.

Point is though that HW and SW layered on top to present an illusion
of a single address space isn't for free and you still trip over the
stuff it hides under the red carpet.

Cheers,
Rupert
 
Rupert said:
Who doesn't ?



Ah, now there we differ. I have been bitten by too many corner
cases, to much erroneous behaviour and far too often by insufficiently
spec'd systems. It's just not funny anymore, and the root cause 9/10
times is the vendor trying to do too much.

Now _that_ is discouraging. If we don't know how to put a large number
of processors together so that the environment presented to the
application is reliable, we are in trouble.

But are you sure you want to go down this road? We were talking about a
specific vendor here.
Why should hardware solve that for software ? It's a software problem,
not a hardware one.

Oh, I'm turning into a casual and careless human factors engineer. ;-).

If you give people a hardware environment that invites confusion between
the hardware and software model, people will, perforce, be confused.

I do think that you have unrealistic expectations for the capabilities
and instincts of the average practitioner of the computational arts. If
you give people the opportunity to obsess about hardware details, that's
what they will obsess about.
I believe you have been pointed at papers that detail specific
applications that BlueGene was designed to solve. If the DoE dudes
want to do something different with it, so be it. It wasn't designed
in a vacuum.
Not to worry. I've actually had people say more intelligent and
insightful things about the logic of the packet-switched architecture of
BlueGene and RedStorm and there are more intelligent things written
down, but I've heard and seen the nearest-neighbor argument often enough
to believe that that's how too many people are thinking, no matter how
wrong the logic may be. The argument is actually considerably more
complicated, and the matter is far from settled or even clear in my own
mind.
[SNIP]
Because there are so many other things to think about.


Eh ? The whole point of a programmer is to fit the domain knowledge
to the tool (as far as that is possible).

The argument for simple hardware is that the programmer has *less*
corner cases to worry about and can spend less time fighting the
hardware. Simplicity has other benefits that the customer does not
see : It helps the Vendor do a more thorough validation of the
platform.

The amount of time I have spent working around errant HW and SW
and wished for something closer to the metal you would not believe.

This is an issue that might warrant the attentions of a careful
anthropologist; viz, the difference between what people claim they would
insist on in terms of reliability and what they actually accept and the
practical consequences of self-delusion.

If what you are saying is an accurate reflection of reality, then I
would say: essay less.

Someone who really does essay less, of course, risks losing a
competitive advantage, possibly even to the extent of losing the
opportunity to compete entirely.

In the world of "good enough" commodity hardware, maybe "good enough"
isn't good enough at all.
This is from a guy who thinks C sucks for app programming too. :P

Even your _own_ c code? ;-).
Infiniband is about 10,000,000 miles away from lightweight comms,
compare and contrast with IEEE-1355 for example.
I was putting infiniband forward only as an example of what happens to
anything that isn't ethernet. :-). A quick perusal of IEEE-1355 reveals
that it has the same problem everything else has: bandwidth requirements
are increasing faster than people can even write specs.
Point is though that HW and SW layered on top to present an illusion
of a single address space isn't for free and you still trip over the
stuff it hides under the red carpet.

Well, I do at least take your point.

RM
 
Robert said:
Rupert Pigott wrote:
[SNIP]

Now _that_ is discouraging. If we don't know how to put a large number
of processors together so that the environment presented to the
application is reliable, we are in trouble.

I maintain that this is best left to the application and OS to sort
out. The HW can do a lot to assist by providing features to help with
fault detection and isolation.
But are you sure you want to go down this road? We were talking about a
specific vendor here.

Most big systems already have. What do you think checkpointing is
about ?

FWIW I think SGI are one of the better outfits around, I like their
NUMAFlex (hopefully remembered the right name) stuff, looks neat.

[SNIP]
If you give people a hardware environment that invites confusion between
the hardware and software model, people will, perforce, be confused.

If anything I think eschewing complexity in hardware would help clear
this up somewhat. In case you haven't noticed CISC machines have a habit
of being treated as RISC machines. The confusion is "WTF are all these
clever instructions around for ? Why don't I have enough registers ?".
I do think that you have unrealistic expectations for the capabilities
and instincts of the average practitioner of the computational arts. If

The gold diggers are getting down-sized and their jobs exported. The
guys who care about their work will fight tooth and nail to keep their
jobs and so I think the net result will be an improvement in skill
levels.
you give people the opportunity to obsess about hardware details, that's
what they will obsess about.

With MPP my feeling is that you *have* to obsess about hardware details
at the moment.

[SNIP]
In the world of "good enough" commodity hardware, maybe "good enough"
isn't good enough at all.

"Good enough" hardware has a habit of pushing the cost elsewhere,
ie: Onto the developer.
Even your _own_ c code? ;-).

Hell yeah. I hate using C for stuff like string bashing for example,
way too fiddly. Compare and contrast with something like Python.

[SNIP]
I was putting infiniband forward only as an example of what happens to
anything that isn't ethernet. :-). A quick perusal of IEEE-1355 reveals
that it has the same problem everything else has: bandwidth requirements
are increasing faster than people can even write specs.

IEEE-1355 has it's origins some fifteen years back. Switch to a
diff PHY layer but keep the switching etc.

The point about 1355 is : At the logical level it specs pretty
much **** all. That mandates that vendors implement **** all,
which means you have **** all to go wrong or get in the
programmer's way. :)

Cheers,
Rupert
 
Rupert said:
The gold diggers are getting down-sized and their jobs exported. The
guys who care about their work will fight tooth and nail to keep their
jobs and so I think the net result will be an improvement in skill
levels.



With MPP my feeling is that you *have* to obsess about hardware details
at the moment.

Possibly so.

I would summarize our positions as "giving people the illusion of a
featureless computational space frees users to think about other things"
and "you're only kidding yourself; in the end, it won't help, because
the complexity is there and you'll have to deal with it, anyway."

As to gold-diggers and competence and whatnot, I believe the problem is
harder than you seem to. The seductive trap of computation is that you
can almost always do _something_. The practitioner has no choice but to
deal with issues, like instabilities that lead to floating point errors,
that keep the computation from proceeding. Hardware and software issues
that keep the computation from proceeding must similarly be dealt with.
Once you've dealt with those issues, how much time is left for
mathematics, science, and engineering? Often, not enough, and, if
you've got a product, survival demands a declaration of victory and
moving on.

The subtext of the current push for more flops is that it will all get
better when the computers get bigger. There are problems that you just
cannot do without more muscle. To the extent that we acquire the
ability to address larger classes of problems, things will, indeed, be
getting better. As to the credibility and usefulness of computation,
I'm not entirely certain that things are getting better.

"But a single system image won't help," I'm sure you will say. Fair enough.

The point about 1355 is : At the logical level it specs pretty
much **** all. That mandates that vendors implement **** all,
which means you have **** all to go wrong or get in the
programmer's way. :)

As always, though, the complexity has to go somewhere. What I can see
of IEEE 1355 looks like an open source project to me. With open source,
you don't have critical information hidden behind NDA's and much of the
decision making and discussion is out in the open and can easily be
accessed. Better than vendor-driven committees? I certainly think so.
You still wind up with many of the same problems, though: software
encrusted with everybody's favorite feature and interfaces that get
broken by changes that are made at a level you have no control over
(like the kernel) and that ripple through everything, for example.

A fair number of people who get involved in these discussions are people
with a Physics/EE background who are fairly confident do-it-yourselfers,
and a fair bit of the puttering comes from places where there are people
wandering around with screwdrives who also know C and a little physics.
I wonder if part of what you object to with systems like Altix is that
it seems like movement away from open systems and back to the bad old
days. Could a bunch of geeks with a little money from, say, DARPA, do
better? Maybe. I think it's been tried at least once. ;-).

RM
 
Robert said:
Rupert Pigott wrote:
[SNIP]

As always, though, the complexity has to go somewhere. What I can see

Yes. I am painfully aware of Mashey's concerns about pushing
complexity from one place to another.
of IEEE 1355 looks like an open source project to me. With open source,

LOL, not at all. It was a write up of the T9000's VCP. Bits and pieces
of that technology have made their way into proprietry solutions.

[SNIP]
You still wind up with many of the same problems, though: software
encrusted with everybody's favorite feature and interfaces that get
broken by changes that are made at a level you have no control over
(like the kernel) and that ripple through everything, for example.

Of course, but it's easier to change a kernel than it is to respin
silicon, or replace several thousand busted boards, right ? A lot of
MPP machines seem to give the customer access to the kernel source
which makes it easier for the desperados to fix the problems. :)

[SNIP]
I wonder if part of what you object to with systems like Altix is that
it seems like movement away from open systems and back to the bad old
days. Could a bunch of geeks with a little money from, say, DARPA, do
better? Maybe. I think it's been tried at least once. ;-).

I don't have a problem with Altix at all. I have a *concern* that
the SSI feature is rather like putting Chrome on a Porsche 917K if
you are really interested in getting good perf out of it on an
arbitary problem + dataset. Data locality is still a key issue.

I don't deny that it will make some apps easier, but in those cases
you are wide open to vendor lock-in IMO. There are worse vendors
than SGI of course, and I don't think they would be quite as evil
as IBM were reputed to be.

For those two reasons I question the long term viability of SSI
MPP machines.

Cheers,
Rupert
 
Rupert Pigott said:
Robert Myers wrote:

[SNIP]
A single system image is no simple cure. It may not be a cure at all.
But it's encouraging that somebody is taking it seriously enough to
build a kilonode machine with a single address space.

Hats off to SGI, kilonode ssi is a neat trick. :)

Let's say you write code that makes use of a large single system
image machine. Let's say SGI fall behind the curve and you need
answers faster : Where can you go for another large single system
image machine ?

I see that kind of awfully clever machine as vendor lock-in waiting
to happen. If you want to avoid lock-in you end up writing your
code to the lowest common denominator, and in this case that will
probably remove any advantage gained by SSI (application depending
of course).

Lets say, instead, that one has an application that seems to require a
256 node machine, but that need might grow in the next couple of years.
SGI's announcement takes the risk out of choosing SGI for that
application.

And after a few more years, a then current 256 node machine will be able
to take the place of a current 1024 node monster, if the application
doesn't grow too much and one is only worried about the machine or SGI
wearing out.
____________________________________________________________________
TonyN.:' (e-mail address removed)
'
 
Tony Nelson wrote:

[SNIP]
Lets say, instead, that one has an application that seems to require a
256 node machine, but that need might grow in the next couple of years.
SGI's announcement takes the risk out of choosing SGI for that
application.

Regardless you are still effectively locked in if you become dependant
on the SSI feature.

There are also some other factors to take into account... Such as does
your application scale to 1024 on that mythical machine ? If it does
not who do you turn to if you are committed to SSI ?
And after a few more years, a then current 256 node machine will be able
to take the place of a current 1024 node monster, if the application
doesn't grow too much and one is only worried about the machine or SGI
wearing out.

Assuming clock rate cranking continues to pay off and the compilers
improve significantly. I figure it'll come down to how much cache Intel
can cram onto an IA-64 die, and that is a diminishing returns game.

BTW : If you read through the immense amount of opinionated stuff I
posted you will see that I actually give SGI some credit. The question
I raise though is : Is SSI really that useful given the lock-in factor ?

Cheers,
Rupert
 
Back
Top