Robert said:
On the other hand, the suggestion was recently made here that maybe we
should just banish SMP as an unacceptable programming style (meaning, I
think, that multiprocessor programming should not be done in a
globally-shared memory space, or at least that the shared space should
be hidden behind something like MPI).
I wonder how much SMP style, and the uniform address spaces that
go with it, can be hidden under VM, pointer swizzling and layers
of software-based caching. Probably not much, really.
The situation is _so_ bad that it doesn't seem embarrassing, apparently,
for Orion Multisystems to take a lame processor, to hobble it further
with a lame interconnect, and to call it a workstation. If the future
of computing really is slices of Wonder Bread in a plastic bag and not a
properly cooked meal, then the Orion box makes some sense. Might as
well get used to it and start programming on an architecture that at
least has the right topology and instruction set, as I believe Andrew
Reilly is suggesting.
Well, I think that the specific instruction set is probably a red
herring. I reckon that an object code specifically designed to be
a target for JIT compilation to a register-to-register VLIW engine
of indeterminate dimensions will turn out to be better ultimately.
There are projects moving in that direction:
http://llvm.cs.uiuc.edu/, and from long, long ago: TAO-group's VM.
Stack-based VM's like JVM and MS-IL might or might not be the
right answer. I guess we'll find out soon enough.
Code portability and density is important, of course, but the main
thing is winning back with dynamic recompilation some of the
unknowables that plain VLIW in-order RISC visits on code.
The Transmeta Eficieon is just the first widely available
processor with embedded-levels of integration (memory and some
peripheral interfaces and hyper-channel for other peripherals) and
power consumption that can do pipelined double-precision floating
point multiply/additions at two flops/clock at an interesting
clock rate. 1.5Ghz is significantly faster than the DSP
competitors. TIC6700 tops out at 300MHz and only does single
precision at the core rate. PowerPC+Altivec doesn't have the
memory controller or the peripheral interconnect to drive up the
areal density. The BlueGene core is about the right shape, but I
haven't seen any industrial/embedded boxes with a few dozen of
them in it, yet. The MIPS and ARM processors that have the
integration don't have the floating point chops. Modern versions
of VIA C3 might be getting interesting (or not: I haven't looked
at their double-precision performance), but have neither the
memory controller nor the hyperchannel, nor quite the MHz. Of
course, Opterons fit that description too, and clock much faster,
but I thought that they consumed considerably more power, too.
Maybe their MIPS/watt is closer than I've given it credit for.
If big computers are to be used to solve problems, they are inevitably
going to fall into the hands of people who are more interested in
solving problems than they are in the computers...as should be. If we
really can't conjure tools for programming them that are reliable in the
hands of relative amateurs, I see it as a more pressing issue than not
being able to do hot fusion (the prospects for wind and solar having
come along very nicely).
For such people, I suspect that the appropriate level of
programming is that of science fiction starship bridge computers:
"here's what I want: make it so". I wonder if anyone has looked
at something like simulated annealing or genetic optimisation to
drive memory access patterns revealed by problems expressed at an
APL or Matlab (or higher) level. For most of the "big science"
problems, I suspect that the "what I want" is not terribly
difficult to express (once you've done the science-level thinking,
of course). The tricky part, at the moment, is having a human
understand the redundancies and dataflows (and numerical
stability) issues well enough to map the direct-form of the
solution to something efficient (on one or on a bunch of
processors). I think that from a sufficient altitude, that looks
like an annealing problem, with dynamic recompilation being the
lower tier mechanism of the optimisation target. The lucky thing
about "big science" problems is that by definition they have big
data, and run for a long time. That time and that amount of data
might as well be used by the machine itself to try to speed the
process up as by a bunch of humans attempting the same thing
without as intimate access to the actual values in the data sets
and computations.
It's late, I've had a few glasses of a nice red and I'm rambling.
Sorry about that. Hope the ramble sparks some other ideas.