Cell is itanium (and some of it's problems) redux?

  • Thread starter Thread starter Robert Myers
  • Start date Start date
R

Robert Myers

Greetings,

The answer to the rhetorical question is plainly "no," but this
Anandtech article suggests some similarities:

http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2379&p=8

I could have been just as provocative by suggesting that Cell is
Netburst redux: giving up execution agility on the assumption that it
isn't needed for certain kinds of applications with good compiler
support:

<quote>

With the major benefit of out-of-order being a decrease in
susceptibility to memory latencies, the Cell architects proposed
another option - what about an in-order core with controllable (read:
predictable) memory latencies?

In-order microprocessors suffer because as soon as you introduce a
cache into the equation, you no longer have control over memory
latencies. Most of the time, a well-designed cache is going to give
you low latency access to the data that you need. But look at the
type of applications that Cell is targeted at (at least initially) -
3D rendering, games, physics, media encoding etc. - all applications
that aren’t dependent on massive caches. Look at any one of Intel’s
numerous cache increased CPUs and note that 3D rendering, gaming and
encoding performance usually don’t benefit much beyond a certain
amount of cache. For example, the Pentium 4 660 (3.60GHz - 2MB L2)
offered a 13% increase in Business Winstone 2004 over the Pentium 4
560 (3.60GHz - 1MB L2), but less than a 2% average performance
increase in 3D games. In 3dsmax, there was absolutely no performance
gain due to the extra cache. A similar lack of performance
improvement can be seen in our media encoding tests. The usage model
of the Playstation 3 isn’t going to be running Microsoft Office; it’s
going to be a lot of these “media rich” types of applications like 3D
gaming and media encoding. For these types of applications, a large
cache isn’t totally necessary - low latency memory access is
necessary, and lots of memory bandwidth is important, but you can get
both of those things without a cache. How? Cell shows you how.

Each SPE features 256KB of local memory, more specifically, not cache.
The local memory doesn’t work on its own. If you want to put
something in it, you need to send the SPE a store instruction. Cache
works automatically; it uses hard-wired algorithms to make good
guesses at what it should store. The SPE’s local memory is the size
of a cache, but works just like a main memory. The other important
thing is that the local memory is SRAM based, not DRAM based, so you
get cache-like access times (6 cycles for the SPE) instead of main
memory access times (e.g. 100s of cycles).

What’s the big deal then? With the absence of cache, but the
presence of a very low latency memory, each SPE effectively has
controllable, predictable memory latencies. This means that a smart
developer, or smart compiler, could schedule instructions for each SPE
extremely granularly. The compiler would know exactly when data
would be ready from the local memory, and thus, could schedule
instructions and work around memory latencies just as well as an
out-of-order microprocessor, but without the additional hardware
complexity. If the SPE needs data that’s stored in the main memory
attached to the Cell, the latencies are just as predictable, since
once again, there’s no cache to worry about mucking things up.

</quote>

Reality? Wishful thinking? The decisive explanation as to why Cell
is a media (and maybe a physics chip) and nothing else?

My apologies to the comp.arch readership for bringing such a low-rent
article on such a downscale chip into such an august forum, but, if
you can get past those issues, there must be lots of interesting
thoughts out there.

If nothing else, one could have all the scheduling threads (itanium,
netburst, OoO, one-die controller) all over again, and this is a
different mix: in-order, no cache, on die controller, local
low-latency memory, multi-threaded, multiple coprocessors--each with
its own hardware prefetch--designed for stream processing. Who's
building the compiler? Does anybody know?

RM
 
Robert Myers wrote:
snip
Reality? Wishful thinking? The decisive explanation as to why Cell
is a media (and maybe a physics chip) and nothing else?

My apologies to the comp.arch readership for bringing such a low-rent
article on such a downscale chip into such an august forum, but, if
you can get past those issues, there must be lots of interesting
thoughts out there.

If nothing else, one could have all the scheduling threads (itanium,
netburst, OoO, one-die controller) all over again, and this is a
different mix: in-order, no cache, on die controller, local
low-latency memory, multi-threaded, multiple coprocessors--each with
its own hardware prefetch--designed for stream processing. Who's
building the compiler? Does anybody know?

RM
Since I believe the first use will be as a video game console,
it would seem appropriate to look at the current and historical
channel for Playstation 2 development systems and software. Sony
is in control of that part of it.

What does a playstation2 game development system look like?

del
 
With the major benefit of out-of-order being a decrease in
susceptibility to memory latencies, the Cell architects proposed
another option - what about an in-order core with controllable (read:
predictable) memory latencies?

I.e., just like a DSP.

[snip]
Reality? Wishful thinking? The decisive explanation as to why Cell
is a media (and maybe a physics chip) and nothing else?

Just like a DSP, you only get to run one application on the beast, because
you can't afford to context-share 8x256kBytes in an interrupt-driven,
time-share arrangement.

That's the "what's new here", really: an abandonment of multi-tasking.

Of course, with lots and lots of cores, perhaps that doesn't matter to
you. Maybe your video player and background HPC job are happy to share
your resources vertically, using assigned dedicated processors, rather
than horizontally, using time slices. The more cores you have, the more
plausible that approach sounds.

It does hint at some interesting OS/control/scheduling software. Maybe
that's where the talk of JVMs comes in? If the "one application" that
knows where things are in the SRAM is a (J)VM, then presumably you can
also make multiple threads appear to be sharing individual SPE's too.
If nothing else, one could have all the scheduling threads (itanium,
netburst, OoO, one-die controller) all over again, and this is a
different mix: in-order, no cache, on die controller, local low-latency
memory, multi-threaded, multiple coprocessors--each with its own
hardware prefetch--designed for stream processing. Who's building the
compiler? Does anybody know?

Re scheduling pre-fetch, I'm sure it's possible. TI's C6000 series of
VLIW DSPs have (from memory) a fixed five cycles to on-chip memory, from
each of a pair of load/store units, and six other FU's to keep busy each
cycle. The compiler does a pretty excellent job, from what I've seen. The
pipeline is strictly in-order, and the whole thing locks if you dare to
code a stall, so you (and the compiler) DON'T DO THAT, as a general rule.
(DMA engines get a work-out in the background.)

Cheers,
 
With the major benefit of out-of-order being a decrease in
susceptibility to memory latencies, the Cell architects proposed
another option - what about an in-order core with controllable (read:
predictable) memory latencies?

I.e., just like a DSP.

[snip]
Reality? Wishful thinking? The decisive explanation as to why Cell
is a media (and maybe a physics chip) and nothing else?

Just like a DSP, you only get to run one application on the beast, because
you can't afford to context-share 8x256kBytes in an interrupt-driven,
time-share arrangement.

That's the "what's new here", really: an abandonment of multi-tasking.

Of course, with lots and lots of cores, perhaps that doesn't matter to
you. Maybe your video player and background HPC job are happy to share
your resources vertically, using assigned dedicated processors, rather
than horizontally, using time slices. The more cores you have, the more
plausible that approach sounds.
The time-slicing, I would have thought, is in the streaming. The
easiest stream to deal with is homogeneous, but it wouldn't have to
be. That's not multi-tasking in the usual sense, but it can be data
driven; i. e., the processor processes whatever is available. Nobody
has said it that I can remember, but I see no fundamental reason why
instructions can't be streamed.
It does hint at some interesting OS/control/scheduling software. Maybe
that's where the talk of JVMs comes in? If the "one application" that
knows where things are in the SRAM is a (J)VM, then presumably you can
also make multiple threads appear to be sharing individual SPE's too.
I don't understand. If the SPE's are hidden behind a VM, why would
you know, or care, what resources the VM is using to carry out your
bidding? And that style of coding is so foreign to DSP's isn't it?
Re scheduling pre-fetch, I'm sure it's possible. TI's C6000 series of
VLIW DSPs have (from memory) a fixed five cycles to on-chip memory, from
each of a pair of load/store units, and six other FU's to keep busy each
cycle. The compiler does a pretty excellent job, from what I've seen. The
pipeline is strictly in-order, and the whole thing locks if you dare to
code a stall, so you (and the compiler) DON'T DO THAT, as a general rule.
(DMA engines get a work-out in the background.)

So, aside from existing tool chains and body of experience for
graphics processors, there is a relevant set of tool chains and body
of experience from the world of DSP's.

It's off the comparison you wanted to make, but, when I looked at the
block diagram of C6000, I immediately wanted to map the PowerPC
processing element of Cell into the C6000 front end (fetch, dispatch,
and decode), with Cell having up to eight data paths to the execution
units (SPE's) to the C6000's two data paths to groups of four bundled
execution units.

RM
 
The time-slicing, I would have thought, is in the streaming. The
easiest stream to deal with is homogeneous, but it wouldn't have to be.
That's not multi-tasking in the usual sense, but it can be data driven;
i. e., the processor processes whatever is available. Nobody has said
it that I can remember, but I see no fundamental reason why instructions
can't be streamed.

I expect that that may be possible, but it depends heavily on how much
persistent state (each of) the stream processes requires. It's unlikely
to be none, but it will be strongly algorithm-dependent. The choices come
down to allowing the stream programs to believe that they "own" the
processor, which requires the persistent state to be swapped in and out as
different streams come in, or requiring them to cooperatively share the
resource through some sort of run-time linking or relocation. It's
certainly a different model from current workstation and HPC processors,
though.
I don't understand. If the SPE's are hidden behind a VM, why would you
know, or care, what resources the VM is using to carry out your bidding?

Oh, the coder wouldn't care. I'm suggesting that that might be a way to
achieve a "conventional" level of abstraction and resource sharing, given
the absence of the usual hardware resources (caches, memory mapping).
And that style of coding is so foreign to DSP's isn't it?

True, but most DSPs don't have to worry about time sharing[1]: they only
run one application from boot until power-down. The developer gets to lay
out the use of the fixed on-chip memory resource at link-time. Sometimes
this involves overlays and tricky multiple use, when the DSP program can
operate in several different ways or modes, but it's still all arranged at
compile/link time. Not like a general purpose workstation.

[1] That's as in interactive multi-programming, rather than within-program
multi-threading. The latter is used very frequently in many DSP
environments, of course, even if only at the level of interrupt handlers.
It's off the comparison you wanted to make, but, when I looked at the
block diagram of C6000, I immediately wanted to map the PowerPC
processing element of Cell into the C6000 front end (fetch, dispatch,
and decode), with Cell having up to eight data paths to the execution
units (SPE's) to the C6000's two data paths to groups of four bundled
execution units.

Hmm. I thought that the SPEs were a bit more autonomous than that. More
like a SIMD DSP all on their own. They do have their own program
store, with conditional branches and subroutine calls, don't they? Do
they take interrupts (on DMA finish, or inter-processor message
reciept, for example), or are they purely synchronous?

Cheers,
 
Back
Top