R
Robert Myers
Greetings,
The answer to the rhetorical question is plainly "no," but this
Anandtech article suggests some similarities:
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2379&p=8
I could have been just as provocative by suggesting that Cell is
Netburst redux: giving up execution agility on the assumption that it
isn't needed for certain kinds of applications with good compiler
support:
<quote>
With the major benefit of out-of-order being a decrease in
susceptibility to memory latencies, the Cell architects proposed
another option - what about an in-order core with controllable (read:
predictable) memory latencies?
In-order microprocessors suffer because as soon as you introduce a
cache into the equation, you no longer have control over memory
latencies. Most of the time, a well-designed cache is going to give
you low latency access to the data that you need. But look at the
type of applications that Cell is targeted at (at least initially) -
3D rendering, games, physics, media encoding etc. - all applications
that aren’t dependent on massive caches. Look at any one of Intel’s
numerous cache increased CPUs and note that 3D rendering, gaming and
encoding performance usually don’t benefit much beyond a certain
amount of cache. For example, the Pentium 4 660 (3.60GHz - 2MB L2)
offered a 13% increase in Business Winstone 2004 over the Pentium 4
560 (3.60GHz - 1MB L2), but less than a 2% average performance
increase in 3D games. In 3dsmax, there was absolutely no performance
gain due to the extra cache. A similar lack of performance
improvement can be seen in our media encoding tests. The usage model
of the Playstation 3 isn’t going to be running Microsoft Office; it’s
going to be a lot of these “media rich” types of applications like 3D
gaming and media encoding. For these types of applications, a large
cache isn’t totally necessary - low latency memory access is
necessary, and lots of memory bandwidth is important, but you can get
both of those things without a cache. How? Cell shows you how.
Each SPE features 256KB of local memory, more specifically, not cache.
The local memory doesn’t work on its own. If you want to put
something in it, you need to send the SPE a store instruction. Cache
works automatically; it uses hard-wired algorithms to make good
guesses at what it should store. The SPE’s local memory is the size
of a cache, but works just like a main memory. The other important
thing is that the local memory is SRAM based, not DRAM based, so you
get cache-like access times (6 cycles for the SPE) instead of main
memory access times (e.g. 100s of cycles).
What’s the big deal then? With the absence of cache, but the
presence of a very low latency memory, each SPE effectively has
controllable, predictable memory latencies. This means that a smart
developer, or smart compiler, could schedule instructions for each SPE
extremely granularly. The compiler would know exactly when data
would be ready from the local memory, and thus, could schedule
instructions and work around memory latencies just as well as an
out-of-order microprocessor, but without the additional hardware
complexity. If the SPE needs data that’s stored in the main memory
attached to the Cell, the latencies are just as predictable, since
once again, there’s no cache to worry about mucking things up.
</quote>
Reality? Wishful thinking? The decisive explanation as to why Cell
is a media (and maybe a physics chip) and nothing else?
My apologies to the comp.arch readership for bringing such a low-rent
article on such a downscale chip into such an august forum, but, if
you can get past those issues, there must be lots of interesting
thoughts out there.
If nothing else, one could have all the scheduling threads (itanium,
netburst, OoO, one-die controller) all over again, and this is a
different mix: in-order, no cache, on die controller, local
low-latency memory, multi-threaded, multiple coprocessors--each with
its own hardware prefetch--designed for stream processing. Who's
building the compiler? Does anybody know?
RM
The answer to the rhetorical question is plainly "no," but this
Anandtech article suggests some similarities:
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2379&p=8
I could have been just as provocative by suggesting that Cell is
Netburst redux: giving up execution agility on the assumption that it
isn't needed for certain kinds of applications with good compiler
support:
<quote>
With the major benefit of out-of-order being a decrease in
susceptibility to memory latencies, the Cell architects proposed
another option - what about an in-order core with controllable (read:
predictable) memory latencies?
In-order microprocessors suffer because as soon as you introduce a
cache into the equation, you no longer have control over memory
latencies. Most of the time, a well-designed cache is going to give
you low latency access to the data that you need. But look at the
type of applications that Cell is targeted at (at least initially) -
3D rendering, games, physics, media encoding etc. - all applications
that aren’t dependent on massive caches. Look at any one of Intel’s
numerous cache increased CPUs and note that 3D rendering, gaming and
encoding performance usually don’t benefit much beyond a certain
amount of cache. For example, the Pentium 4 660 (3.60GHz - 2MB L2)
offered a 13% increase in Business Winstone 2004 over the Pentium 4
560 (3.60GHz - 1MB L2), but less than a 2% average performance
increase in 3D games. In 3dsmax, there was absolutely no performance
gain due to the extra cache. A similar lack of performance
improvement can be seen in our media encoding tests. The usage model
of the Playstation 3 isn’t going to be running Microsoft Office; it’s
going to be a lot of these “media rich” types of applications like 3D
gaming and media encoding. For these types of applications, a large
cache isn’t totally necessary - low latency memory access is
necessary, and lots of memory bandwidth is important, but you can get
both of those things without a cache. How? Cell shows you how.
Each SPE features 256KB of local memory, more specifically, not cache.
The local memory doesn’t work on its own. If you want to put
something in it, you need to send the SPE a store instruction. Cache
works automatically; it uses hard-wired algorithms to make good
guesses at what it should store. The SPE’s local memory is the size
of a cache, but works just like a main memory. The other important
thing is that the local memory is SRAM based, not DRAM based, so you
get cache-like access times (6 cycles for the SPE) instead of main
memory access times (e.g. 100s of cycles).
What’s the big deal then? With the absence of cache, but the
presence of a very low latency memory, each SPE effectively has
controllable, predictable memory latencies. This means that a smart
developer, or smart compiler, could schedule instructions for each SPE
extremely granularly. The compiler would know exactly when data
would be ready from the local memory, and thus, could schedule
instructions and work around memory latencies just as well as an
out-of-order microprocessor, but without the additional hardware
complexity. If the SPE needs data that’s stored in the main memory
attached to the Cell, the latencies are just as predictable, since
once again, there’s no cache to worry about mucking things up.
</quote>
Reality? Wishful thinking? The decisive explanation as to why Cell
is a media (and maybe a physics chip) and nothing else?
My apologies to the comp.arch readership for bringing such a low-rent
article on such a downscale chip into such an august forum, but, if
you can get past those issues, there must be lots of interesting
thoughts out there.
If nothing else, one could have all the scheduling threads (itanium,
netburst, OoO, one-die controller) all over again, and this is a
different mix: in-order, no cache, on die controller, local
low-latency memory, multi-threaded, multiple coprocessors--each with
its own hardware prefetch--designed for stream processing. Who's
building the compiler? Does anybody know?
RM