I guess it depends on how high you want to go but to get to 4GB with 512Mb
chips required some fairly fancy footwork already, with different width
memory chips, if I'm remembering things right.
1. Is this really a problem? If we're talking about blades and compute
farms, 4G per CPU, 8 GB and 2 CPU per node really should be enough
for anything. If we're not talking about blades, and you really want
2. FB-DIMM controllers can be easily integrated into the Northbridge.
The pincounts are relatively low, the FlexIO interfaces can certainly
support the additional BW, and the longer latencies can be amortized
in the type of applications (large memory capacity, FP-dominant
number crunching codes) that we're presumably discussing here. The
draw back is that a couple of channels of fully populated FBD's will
quickly eat a lot of power, and you can't do blades, but certainly
1U or 2U is doable.
I don't have time to look up the details right now but ISTR that the DP
performance was not even in the same ball-park as the SP - it just wasn't
good enough.
The CELL processor has > 200 GFlops of SP compute power. Saying that
the CELL processor is inadequate because the DP performance is not in
the same ball park as SP performance is entirely silly. There is
no device that has DP performance that's in the same ball park as 200+
GFlops.
To see if the CELL processor is "good enough", you'll have to define
what is "good enough". Opteron {SC,DC}, Itanium {Madison,Montecito},
Pentium {4,D,M}, etc. Are any of these devices "good enough" in terms
of the DP FP performance? What kind of DP Flops can each of these devices
produce per cycle, and how many? DP FADD ops, DP FMUL ops, DP FMADD ops?
FWIW, each SPE in the CELL processor can produce 2 DP FMADD ops every
7 cycles, and the PPE can sustain the throughput of 1 DP FMADD op per
cycle. Pick your favorate device and compare.