First Picture of a Cell Processor - Smaller Than a Pushpin, More Powerful Than a PC

  • Thread starter Thread starter NEXT BOX
  • Start date Start date
NEXT said:
Daniel said:
On 2005-02-10 01:07:53 -0500, "NEXT BOX" <[email protected]> said:


http://macdailynews.com/index.php/weblog/comments/4967/

Intel has no answer to the 'Cell' processor; will Apple use it in
Macs?


Not too likely. The cell is not a general purpose CPU; Most programs
have a working set larger than the 256k memory the Cell supports, and
changing them to work in 256k increments would rather hard.

The Cell seems to be intended for use as a sort of co-processor, and
maybe they could be used to build a sort of GPU. But the benefits will
be in any event pretty limited.

not -

the cell is a G5 processor with *ADDTIONALLY* cpus on the same chip

also if you had bothered to read -
the cell's G5 carries a 32K level cache, and a 512K level 2 cache
while the 'satellite' cpus carry the 256K cache

excepting die size (giant at 221 mm) there is no practical reason that
these could not be put into Macs, or any other general use PC (including
in a dual [Mac tower] or quad [IBM Server] processor configuration).


Cell's master CPU is not a G5 processor. it's much more streamlined than a
G5.
My bad - I read PPC processor, and inferred G5

however, from the article I read, there is no (more) specific info on
the core processor.

This is the article I read:
<http://www.computerworld.com/hardwaretopics/hardware/story/0,10801,99607,00.html>
 
Fetch said:
Daniel said:
On 2005-02-10 01:07:53 -0500, "NEXT BOX" <[email protected]> said:

http://macdailynews.com/index.php/weblog/comments/4967/

Intel has no answer to the 'Cell' processor; will Apple use it in
Macs?


Not too likely. The cell is not a general purpose CPU; Most programs
have a working set larger than the 256k memory the Cell supports, and
changing them to work in 256k increments would rather hard.

The Cell seems to be intended for use as a sort of co-processor, and
maybe they could be used to build a sort of GPU. But the benefits will
be in any event pretty limited.
not -

the cell is a G5 processor with *ADDTIONALLY* cpus on the same chip

also if you had bothered to read -
the cell's G5 carries a 32K level cache, and a 512K level 2 cache
while the 'satellite' cpus carry the 256K cache

excepting die size (giant at 221 mm) there is no practical reason that
these could not be put into Macs, or any other general use PC (including
in a dual [Mac tower] or quad [IBM Server] processor configuration).

Cell's master CPU is not a G5 processor. it's much more streamlined than a
G5.

No, at least the way I "read" things the cell's cpu is pretty much
a stripped-down in-order PowerPC. The 'G5' is very much out-of-order and
hardly stripped.
 
TAke a look at Alpha's Piranha , thats a perfect eg. of how ppl r trying
to employ Rambus for Server CMP technology

And a fat load of good it's done the Alpha. To be fair though, I
don't think that it's collapse had anything to do with using Rambus
memory one way or the other.

Besides, ever there it's not exactly showing tremendous performance.
Even in SPEC CFP2000, kind of the ideal benchmark for this
super-bandwidth setup, the EV7 is still being beaten by current x86
processors. When looking at other tests it's even less impressive.

Now again those results are not necessarily an indication that Rambus
is a failure because there are MANY other issues holding the EV7s
performance back (only some of which are technical). Unfortunately
it's impossible to say just how well/poorly a theoretical EV7 with a
DDR-SDRAM interface might have compared to the RDRAM one. However I'm
not sure that one can really hold up EV7 as a success story for Rambus
in any way.
I guess my expectations for Cell were pretty high. But i guess most of
the size of the die is for accomodating the greater number of pins
required to feed the behemoth. 8 Vector processor on a die is close to
being the Cray's X1 node and they had to cool it using freon. I guess
the cache is way too small (2.5MB is nothing for a high performance
Vector processor)

Keep in mind that this chip and the Cray X1 are designed for VERY
different workloads. The Cell is likely to be rather weak
(comparatively speaking) when running scientific computing as compared
to how it works in a gaming console.
 
I think the concept is to make everything part of a distributed
computing network by using the same processing unit. Sure your TV
might not need that much processing power to handle HDTV decryption
but you can link it up to your Cell PC to offload some of that wedding
video encoding you're doing... along with your radio, refrigerator and
PDA.

Just think of how much $$$ we're talking about here if everything runs
on Cell so I'm sure they will figure out a way to use Cell in TV,
toilet flush and trash bin too :pPpPP

Riiiiigghtttt.... And just WHO is going to write the software for my
networked toilet? Please tell me it's not Microsoft, because that's
one place I do NOT want a blue-screen! :>
 
not -

the cell is a G5 processor with *ADDTIONALLY* cpus on the same chip

The Cell's PowerPC core is more definitely NOT a G5 processor, nor is
it really in any way related to it except that they both use the PPC
ISA.
also if you had bothered to read -
the cell's G5 carries a 32K level cache, and a 512K level 2 cache
while the 'satellite' cpus carry the 256K cache

excepting die size (giant at 221 mm) there is no practical reason that
these could not be put into Macs, or any other general use PC (including
in a dual [Mac tower] or quad [IBM Server] processor configuration).

Sure, they could, but they wouldn't be anywhere near as fast as any
current processors unless you can make use of the vector engine. And
doing that means changing ALL of the software, and that is
*EXPENSIVE*.


We will definitely NOT see the Cell processor (or it's descendants)
showing up in any meaningful way (ie not just a couple special-purpose
workstations from IBM) as the main processor in desktop computers for
AT LEAST 5 years, probably more like 10+ years if ever. And yes, you
can quote me on that.
 
Given all of that, Rambus would also make sense for graphics cards
(where all of the same things hold). Yet both Nvidia and ATI go with
DDR-SDRAM. Why? Is the savings by reducing pins less than the
premium for Rambus RAM? If so, wouldn't it also make sense for PS3 to
use DDR(2)-SDRAM?

I've wondered the very same thing myself. To me, from the outside at
least, it seems like it would make sense. Rambus memory has been used
in video cards before, but only in some very rare situations. I don't
even think there would be much of a cost difference for the memory
chips even, given that video cards use very high-end/high speed GDDR3
memory, quite a bit more expensive than the DDR memory used in desktop
PCs.

However nVidia has commented before that they have evaluated Rambus
memory on more than one occasion and found it to be unsuitable for
their application. It's always made me wonder if maybe they know
something that the rest of us don't? Or maybe their decision was only
partly based on technical reasons and partly on more political/legal
related ones? Or maybe it has to do with Rambus licensing fees for
the memory controller rather than for the memory itself?

In short, I really don't know what the answer is here.
In essensce, what's so different between the PS3 and graphics cards
that one goes with Rambus whereas the others go with DDR(2)?

Not much from where I'm standing.
 
Back to Basics
The fundamental task of a processor is to manage the flow of data through
its computational units. However in the past two decades, each successive
generation of processors for personal computers has added more transistors
dedicated to increasing the performance of spaghetti-like integer code. For
example, it is well known that typical integer codes are branchy and that
branch mispredict penalties are expensive; in an effort to minimize the
impact of branch instructions, transistors were used to develop highly
accurate branch predictors. Aside from branch predictors, sophisticated
cache hierarchies with large tag arrays and predictive cache prefetch units
attempt to hide the complexity of data movement from the software, and
further increase the performance of single threaded applications. The
pursuit of single threaded performance can be observed in recent years in
the proposal of extraordinarily deeply pipelined processors designed
primarily to increase the performance of single threaded applications, at
the cost of higher power consumption and larger transistor budgets.

The fundamental idea of the CELL processor project is to reverse this trend
and give up the pursuit of single threaded performance, in favor of
allocating additional hardware resources to perform parallel computations.
That is, minimal resources are devoted toward the execution of single
threaded workloads, so that multiple DSP-like processing elements can be
added to perform more parallelizable multimedia-type computations. In the
examination of the first implementation of the CELL processor, the theme of
the shift in focus from the pursuit of single threaded integer performance
to the pursuit of multiply threaded, easily parallelizable multimedia-type
performance is repeated throughout.

CELL Basics
The CELL processor is a collaboration between IBM, Sony and Toshiba. The
CELL processor is expected by this consortium to provide computing power an
order of magnitude above and beyond what is currently available to its
competitors. The International Solid-State Circuits Conference (ISSCC) 2005
was chosen by the group as the location to describe the basic hardware
architecture of the processor and announce the first incarnation of the CELL
processor family.

Members of the CELL processor family share basic building blocks, and
depending on the requirement of the application, specific versions of the
CELL processor can be quickly configured and manufactured to meed that need.
The basic building blocks shared by members of the CELL family of processor
are the following:

a.. The PowerPC Processing Element (PPE)
b.. The Synergistic Processing Element (SPE)
c.. The L2 Cache
d.. The internal Element Interconnect Bus(EIB)
e.. The shared Memory Interface Controller (MIC) and
f.. The FlexIO interface
Each SPE is in essence a private system-on-chip (SoC), with the processing
unit connected directly to 256KB of private Load Store (LS) memory. The PPE
is a dual threaded (SMT) PowerPC processor connected to the SPE's through
the EIB. The PPE and SPE processing elements access system memory through
the MIC, which is connected to two independent channels of Rambus XDR
memory, providing 25 GB/s of memory bandwidth. The connection to I/O is done
through the FlexIO interface, also provided by Rambus, providing 44.8 GB/s
of raw outbound BW and 32 GB/s of raw inbound bandwidth for total I/O
bandwidth of 76.8 GB/s. At ISSCC 2005, IBM announced that the first
implementation of the CELL processor has been tested to operate at
frequencies above 4 GHz. In the CELL processor, each SPE is capable of
sustaining 4 FMADD operations per cycle. At an operating frequency of 4 GHz,
the CELL processor is thus capable of achieving a peak throughput rate of
256 GFlops from the 8 SPE's. Moreover, the PPE can contribute some amount of
additional compute power with its own FP and VMX units.



http://www.realworldtech.com/includes/images/articles/cell-1.gif

Figure 1 - Die photo of CELL processor with block diagram overlay



Figure 1 shows the die photo of the first CELL processor implementation with
8 SPE's. The sample processor tested was able to operate at a frequency of 4
GHz with Vdd of 1.1V. The power consumption characteristics of the processor
were not disclosed by IBM. However, estimates in the range of 50 to 80 Watts
@ 4 GHz and 1.1 V were given. One unconfirmed report claims that at the
extreme end of the frequency/voltage/power spectrum, one sample CELL
processor was observed to operate at 5.6 GHz with 1.4 V Vdd and consumed 180
W of power.

As described previously, the CELL processor with 8 SPE's operating at 4 GHz
has a peak throughput rate of over 256 GFlops. To provide the proper balance
between processing power and data bandwidth, an enormously capable system
interconnects and memory system interface is required for the CELL
processor. For that task, the CELL processor was designed as a Rambus
Sandwich, with Redwood Rambus Asic Cell (RRAC) acting as the system
interface on one end of the CELL processor, and the XDR (formerly
Yellowstone) high bandwidth DRAM memory system interface on the other end of
the CELL processor. Finally, the CELL processor has 2954 C4 contacts to the
3-2-3 organic package, and the BGA package is 42.5 mm by 42.5 mm in size.
The BGA package contains 1236 contacts, 506 of which are signal
interconnects and the remainder are devoted to power and ground
interconnects.



http://www.realworldtech.com/includes/images/articles/cell-2.gif

Figure 2 - Per stage circuit delay depth of 11 FO4 often left only 5~8 FO4
for logic flow

The first incarnation of the CELL processor is implemented in a 90nm SOI
process. IBM claims that while the logic complexity of each pipeline stage
is roughly comparable to other processors with a per stage logic depth of 20
FO4, aggressive circuit design, efficient layout and logic simplification
enabled the circuit designers of the CELL processor to reduced the per stage
circuit delay to 11 FO4 throughout the entire design. The design methodology
deployed for the CELL processor project provides an interesting contrast to
that of other IBM processor projects in that the first incarnation of the
CELL processor makes use of fully custom design. Moreover, the full custom
design includes the use of dynamic logic circuits in critical data paths. In
the first implementation of the CELL processor, dynamic logic was deployed
for both area minimization as well as performance enhancement to reach the
aggressive goal of 11 FO4 circuit delay per stage. Figure 2 shows that with
the circuit delay depth of 11 FO4, oftentimes only 5~8 FO4 are left for
inter-latch logic flow.

The use of dynamic logic presents itself as an interesting issue in that
dynamic logic circuits rely on the capability of logic transistors to retain
a capacitive load as temporary storage. The decreasing capacitance and
increasing leakage of each successive process generation means that dynamic
logic design becomes more challenging with each successive process
generation. In addition, dynamic circuits are reportedly even more
challenging on SOI based process technologies. However, circuit design
engineers from IBM believe that the use of dynamic logic will not present
itself as an issue in the scalability of the CELL processor down to 65 nm
and below. The argument was put forth that since the CELL processor is a
full custom design, the task of process porting with dynamic circuits is no
more and no less challenging than the task of process porting on a design
without dynamic circuits. That is, since the full custom design requires the
re-examination and re-optimization of transistor and circuit characteristics
for each process generation, if a given set of dynamic logic circuits become
impractical for specific functions at a given process node, that set of
circuits can be replaced with static circuits as needed.

The process portability of the CELL processor design is an interesting topic
due to the fact that the prototype CELL processor is a large device that
occupies 221 mm2 of silicon area on the 90 nm process. Comparatively, the
IBM PPC970FX processor has a die size of 62 mm2 on the 90 nm process. The
natural question then arises as to whether Sony will choose to reduce the
number of SPE's to 4 for the version of the CELL processor to appear in the
next generation Playstation, or keep the 8 SPE's and wait for the 65 nm
process before it ramps up the production of the next generation
Playstation. Although no announcements or hints have been given, IBM's
belief in regards to the process portability of the CELL processor design
does bode well for the 8 SPE path since process shrinks can be relied on to
bring down the cost of the CELL processor at the 65 nm node and further at
the 45 nm node.



Floating Point Capability
As described previously, the prototype CELL processor's claim to fame is its
ability to sustain a high throughput rate of floating point operations. The
peak rating of 256 GFlops for the prototype CELL processor is unmatched by
any other device announced to date. However, the SPE's are designed for
speed rather than accuracy, and the 8 floating point operations per cycle
are single precision (SP) operations. Moreover, these SP operations are not
fully IEEE754 compliant in terms of rounding modes. In particular, the SP
FPU in the SPE rounds to zero. In this manner, the CELL processor reveals
its roots in Sony's Emotion Engine. Similar to the Emotion Engine, the SPE's
single precision FPU also eschewed rounding mode trivialities for speed.
Unlike the Emotion Engine, the SPE contains a double precision (DP) unit.
According to IBM, the SPE's double precision unit is fully IEEE854
compliant. This improvement represents a significant capability, as it
allows the SPE to handle applications that require DP arithmetic, which was
not possible for the Emotion Engine.

Naturally, nothing comes for free and the cost of computation using the DP
FPU is performance. Since multiple iterations of the same FPU resources are
needed for each DP computation, peak throughput of DP FP computation is
substantially lower than the peak throughput of SP FP computation. The
estimate given by IBM at ISSCC 2005 was that the DP FP computation in the
SPE has an approximate 10:1 disadvantage in terms of throughput compared to
SP FP computation. Given this estimate, the peak DP FP throughput of an 8
SPE CELL processor is approximately 25~30 GFlops when the DP FP capability
of the PPE is also taken into consideration. In comparison, Earth Simulator,
the machine that previously held the honor as the world's fastest
supercomputer, uses a variant of NEC's SX-5 CPU (0.15um, 500 MHz) and
achieves a rating of 8 GFlops per CPU. Clearly, the CELL processor contains
enough compute power to present itself as a serious competitor not only in
the multimedia-entertainment industry, but also in the scientific community
that covets DP FP performance. That is, if the non-trivial challenges of the
programming model and memory capacity of the CELL processor can be overcome,
the CELL processor may be a serious competitor in applications that its
predecessor, the Emotion Engine, could not cover.



SPE Overview

http://www.realworldtech.com/includes/images/articles/cell-3.gif

Figure 3 - SPE die photo with functional unit overlay

Figure 3 shows the die photo of the Synergistic (or just plain SIMD)
Processing Element (SPE). The SPE is a specialized processing element
dedicated to the computation of SIMD-type data streams. The SPE has 256KB of
private memory, referred to as the Load Store (LS) unit, implemented as four
separate arrays of 64 KB each. The LS is a private, non-coherent address
space that is separate from the system address space. The LS is implemented
using ECC protected arrays of single ported SRAM. The LS has been optimized
to sustain high bandwidth and small cell size. The cell size is 0.99µm2 on
the 90nm SOI process, and access latency to LS is 6 cycles.

SPE Architecture

To minimize usage of non-computational hardware, the SPE does not have
hardware for data fetch and branch prediction. These tasks are instead
relegated to software. The SPE implements an improper subset of the VMX
instruction set, and all instructions are 32 bits in length. The SPE
instructions operate on a unified register file with 128 registers. The
registers are 128 bits in width and most instructions operate on the 128 bit
operands by treating them as four separate 32 bit operands. Due to the 18
cycle branch misprediction penalty and the lack of a branch predictor,
tremendous effort will have to be devoted to avoiding branches. The
inclusion of the large register file is thus a necessary element in
eliminating unnecessary branches via loop unrolling.



http://www.realworldtech.com/includes/images/articles/cell-4.gif



Figure 4 - SPE Organization

The SPE is an in-order processor that can issue two instructions per cycle
to seven execution units in two different pipelines. Typically, each
instruction makes use of 3 source operands to produce 1 result. The operands
are fetched from either the register file or the forward network. Due to the
in-order nature of the pipeline and the strict issue rules, the processor
makes use of the forwarding network to minimize execution bubbles. To
support the dual issue pipeline, each of which can source 3 operands and
produce on result per cycle, the register file has 6 read ports and 2 write
ports. Register file access takes 2 cycles.





Load Store Unit


The Load Store unit is a privately addressed, non-coherent address space for
the SPE. Data is moved in and out of the Load Store unit in 128 Byte lines
by the DMA engine. Due to the fact that the LS must simultaneously support
DMA transfers into the SPE, DMA transfers out of the SPE as well as local
accesses by the execution units, IBM expects that the LS unit would have a
utilization rate as high as 80~90% when the SPE is running optimally. As a
result, the DMA engine must schedule data transfers to avoid contentions on
the system bus and LS. While the use of the software controlled data
movement mechanism and the lack of a cache increases the difficulty of
programming the SPE, the explicit software management aspect of the SPE
means that it is well suited to support real time applications.

http://www.realworldtech.com/includes/images/articles/cell-5.gif



Figure 5 - Software scheduled threads overlapping computation and data
streaming

In the CELL processor, the software manages the DMA and reserves channels to
move data to and from the LS. The DMA is programmed and resources allocated
for the movement of data in response to requests. The request queue in the
SPE supports up to 16 outstanding requests. Each request can transfer up to
16 Kb of data. Once the data is moved into the LS, the SPE then performs the
computation by accessing the private LS in isolation. Ideally, each SPE
would overlap computation with data streaming, and two or more software
managed threads can operate concurrently on a SPE at a given instance in
time. In such a scenario, while one thread is moving data in and out of the
LS via the DMA engine, a second thread can occupy the computing resources of
the SPE. Figure 5 illustrates the basic idea of using software managed
threads to explicitly overlap computation and data movement.



SPE Pipeline

http://www.realworldtech.com/includes/images/articles/cell-6.gif

Figure 6 - SPE pipeline diagram



http://www.realworldtech.com/includes/images/articles/cell-7.gif

Table 1 - Unit latencies for SPE instructions.

Figure 6 shows the pipeline diagram of the SPE and Table 1 shows the unit
latency of the SPE. Figure 6 shows that the SPE pipeline makes heavy use of
the forward-and-delay concept to avoid the access latency of a register file
access in the case of dependent instructions that flow through the pipeline
in rapid succession.

One interesting aspect of the floating point pipeline is that the same
arrays are used for floating point computation as well as integer
multiplication. As a result, integer multiplies are sent to the floating
point pipeline, and the floating point pipeline bypasses the FP handling and
computes the integer multiply.



SPE Schmoo Plot

http://www.realworldtech.com/includes/images/articles/cell-8.gif



Figure 7 - Schmoo plot for the SPE

Figure 7 shows the schmoo plot for the SPE. The schmoo plot shows that the
SPE can comfortably operate at a frequency of 4 GHz with Vdd of 1.1 V,
consuming approximately 4 W. The schmoo plot also reveals that due to the
careful segmentation of signal path lengths, the design is far from being
wire delay limited. Frequency scaling relative to voltage continues past 1.3
V. This schmoo plot also contributes to the plausibility of the unconfirmed
report that the CELL processor could operate at upwards of 5.6 GHz.

"Unknown" Functional Units: ATO and RTB


Oftentimes when a paper relating to a complex project is written
collaboratively by a group of people, details are lost. Still, it appeared
as rather humorous that of the six design engineers and architects from the
CELL processor project present at Tuesday evening's chat session, no one
could recall what the acronyms ATO and RTB stood for. ATO and RTB are
functional blocks labeled in the floorplan of the SPE. However, the
functionality of these functional blocks or the meaning of the acronym were
neither noted on the floorplan, nor explained in the paper, nor mentioned in
the technical presentation. In an effort to cover all the corners, this
author placed the question on a list of questions to be asked of the CELL
project team members. Hilarity thus ensued as slightly embarrassed CELL
project members stared blankly at each other in an attempt to recall the
functionality or definition of the acronyms.

In all fairness, since the SPE was presented on Monday and the CELL
processor itself was presented on Tuesday, CELL project members responsible
for the SPE were not present for Tuesday evening's chat sessions. As a
result, the team members responsible for the overall CELL processor and
internal system interconnects were asked to recall the meaning of acronyms
of internal functional units within the SPE. Hence, the task was
unnecessarily complicated by the absence of key personnel that would have
been able to provide the answer faster than the CELL processor can rotate a
million triangles by 12 degrees about the Z axis.

After some discussion (and more wine), it was determined that the ATO unit
is most likely the Atomic (memory) unit responsible for coherency
observation/interaction with dataflow on the EIB. Then, after the injection
of more liquid refreshments (CH3CH2OH), it was theorized that the RTB most
likely stood for some sort of Register Translation Block whose precise
functionality was unknown to those outside of the SPE. However, this theory
would turn out to be incorrect.

Finally, after sufficient numbers of hydrocarbon bonds have been broken down
into H-OH on Wednesday, a member of the CELL processor team member tracked
down the relevant information and he writes:

The R in RTB is an internal 1 character identifier that denotes that the RTB
block is a unit in the SPE. The TB in RTB stands for "Test Block". It
contains the ABIST (Array Built In Self Test) engines for the Local Store
and other arrays in the SPE, as well as other test related control functions
for the SPE.



Element Interconnect Bus


The element interconnect bus is the on chip interconnect that ties together
all of the processing, memory, and I/O elements on the CELL processor. The
EIB is implemented as a set of four concentric rings that is routed through
portions of the SPE, where each ring is a 128 bit wide interconnect. To
reduce coupling noises, the wires are arranged in groups of four and
interleaved with ground and power shields. To further reduce coupling
noises, the direction of data flow alternates between each adjacent ring
pair. Data travels on the EIB through staged buffer/repeaters at the
boundaries of each SPE. That is, data is driven by one set of staged buffer
and latched by the buffer at the next stage every clock cycle. Data moving
from one SPE through other SPE's requires the use of repeaters in the
intermediary SPE's for the duration of the transfer. Independently from the
buffer/repeater elements, separate data on/off ramps exist in the BIU of the
SPE, as data targeted for the LS unit of a given SPE can be off-loaded at
the BIU. Similarly, outgoing data can be placed onto the EIB by the BIU.



http://www.realworldtech.com/includes/images/articles/cell-9.gif



Figure 8 - Counter rotational rings of the EIB - 4 SPE's shown

The design of the EIB is specifically geared toward the scalability of the
CELL processor. That is, signal path lengths on the EIB do not change
regardless of the number of SPE's in a given CELL processor configuration.
Since the data travels no more than the width of one SPE, more SPE's on a
given CELL processor simply means that the data transport latency increases
by the number of additional hops through those SPE's. Data transfer through
the EIB is controlled by the EIB controller, and the EIB controller works
with the DMA engine and the channel controllers to reserve the buffers
drivers for certain number of cycles for each data transfer request. The
data transfer algorithm works by reserving channel capacity for each data
transfer, thus providing support for real time applications. Finally, the
design and implementation of the EIB has a curious side effect in that it
limits the current version of the CELL processor to expand only along the
horizontal axis. Thus, the EIB enables the CELL processor to be highly
configurable and SPE's can be quickly and easily added or removed along the
horizontal axis, and the maximum number of SPE's that can be added is set by
the maximum width of the chip allowable by the reticule size of the
fabrication equipment.



The POWERPC Processing Element
Neither microarchitectural details nor the performance characteristics of
the POWERPC Processing Element were disclosed by IBM during ISSCC 2005.
However, what is known is that the PPE processor core is a new core that is
fully compliant with the POWERPC instruction set, the VMX instruction set
extension inclusive. Additionally, the PPE core is described as a two issue,
in-order, 64 bit processor that supports 2 way SMT. The L1 cache sizes of
the PPE is reported to be 32KB each, and the unified L2 cache is 512 KB in
size. Furthermore, the lineage of the PPE can be traced to a research
project commissioned by IBM to examine high speed processor design with
aggressive circuit implementations. The results of this research project
were published by IBM first in the Journal of Solid State Circuits (JSSC) in
1998, then again in ISSCC 2000.

The paper published in JSSC in 1998 described a processor implementation
that supported a subset of the POWERPC instruction set, and the paper
published in ISSCC 2000 described a processor that supported the complete
POWERPC instruction set and operated at 1 GHz on a 0.25µm process
technology. The microarchitecture of the research processor was disclosed in
some detail in the ISSCC 2000 paper. However, that processor was a single
issue processor whose design goal was to reach high operating frequency by
limiting pipestage delay to 13 FO4, and power consumption limitations were
not considered. For the PPE, several major changes in the design goal
dictated changes in the microarchitecture from the research processor
disclosed at ISSCC in 2000. Firstly, to further increase frequency, the per
stage circuit delay design target was lowered from 13 FO4 to 11 FO4.
Secondly, limiting power consumption and minimize leakage current were added
as high priority design goals for the PPE. Collectively, these changes
limited the per stage logic depth, and the pipeline was lengthened as a
result. The addition of SMT and the two issue design goal completed the
metamorphosis of the research processor to the PPE. The result is a
processing core that operates at a high frequency with relatively low power
consumption, and perhaps relatively poorer scalar performance compared to
the beefy POWER5 processor core.



Rambus XDR Memory System

http://www.realworldtech.com/includes/images/articles/cell-10.gif



Figure 9 - The two channel XDR Memory System

To provide machine balance and support the peak rating of more than 256 SP
GFlops (or 25-30 DP GFlops), the CELL processor requires an enormously
capable memory system. For that reason, two channels of Rambus XDR memory
are used to obtain 25.2 GB/s of memory bandwidth. In the XDR memory system,
each channel can support a maximum of thirty-six devices connected to the
same command and address bus. The data bus for each device connects to the
memory controller through a set of bi-directional point-to-point
connections. In the XDR memory system, addresses and commands are sent on
the address and command bus at a rate of 800 Mbits per second (Mbps), and
the point to point interface operates at a datarate of 3.2 Gbps. Using DRAM
devices with 16 bit wide data busses, each channel of XDR memory can sustain
a maximum bandwidth of 102.4 Gbps (2 x 16 x 3.2), or 12.6 GB/s. The CELL
processor can thus achieve a maximum bandwidth of 25.2 GB/s with a 2
channel, 4 device configuration.

The obvious advantage of the XDR memory system is the bandwidth that it
provides to the CELL processor. However, in the configuration illustrated in
figure 9, the maximum of 4 DRAM devices means that the CELL processor is
limited to 256 MB of memory, given that the highest capacity XDR DRAM device
is currently 512 Mbits. Fortunately, XDR DRAM devices could in theory be
reconfigured in such a way so that more than 36 XDR devices can be connected
to the same 36 bit wide channel and provide 1 bit wide data bus each to the
36 bit wide point-to-point interconnect. In such a configuration, a two
channel XDR memory can support upwards of 16 GB of ECC protected memory with
256 Mbit DRAM devices or 32 GB of ECC protected memory with 512 Mbit DRAM
devices. As a result, the CELL processor could in theory address a large
amount of memory if the price premium of XDR DRAM devices could be
minimized. IBM did not release detailed information about the configuration
of the XDR memory system. One feature to watch for in the future is ECC
support in the DRAM memory system. Since ECC support is clearly not a
requirement of a processor to be used in a game machine, the presence of ECC
support would likely indicate IBM's ambition to promote the use of CELL
processors in applications that require superior reliability, availability
and serviceability, such as HPC, workstation or server systems.

Incidentally, Toshiba is a manufacturer of XDR DRAM devices. Presumably it
brought the XDR memory controller and memory system design expertise to the
table, and could ramp up production of XDR DRAM devices as needed.

FlexIO System Interface

At ISSCC 2005, Rambus presented a paper on the FlexIO interface used on the
CELL processor. However, the presentation was limited to describing the
physical layer interconnect. Specifically, the difficulties of implementing
the Redwood Rambus ASIC Cell on IBM's 90nm SOI process were examined in some
detail. While circuit level issues regarding the challenges of designing
high speed I/O interfaces on an SOI based process are in their own right
extremely intriguing topics, the focus of this article is geared toward the
architectural implications of the high bandwidth interface. As a result, the
circuit level details will not be covered here. Interested readers are
encouraged to seek out details on Rambus's Redwood technology separately.


What is known about the system interface of the CELL processor is that the
FlexIO consists of 12 byte lanes. Each byte lane is a set of 8 bit wide,
source synchronous, unidirectional, point-to-point interconnects. The FlexIO
makes use of differential signaling to achieve the data rate of 6.4 Gb per
second per signal pair, and that data rate in turn translates to 6.4 GB/s
per byte lane. The 12 byte lanes are asymmetric in configuration. That is, 7
byte lanes are outbound from the CELL processor, while 5 byte lanes are
inbound to the CELL processor. The 12 byte lanes thus provide 44.8 GB/s of
raw outbound bandwidth and 32 GB/s of raw inbound bandwidth for total I/O
bandwidth of 76.8 GB/s. Furthermore, the byte lanes are arranged into two
groups of ports: one group of ports are dedicated to non-coherent off-chip
traffic, while the other group of ports are usable for coherent off-chip
traffic. It seems clear that Sony itself is unlikely to make use of a
coherent, multiple CELL processor configuration for Playstation 3. However,
the fact that the PPE and the SPE's can snoop traffic transported through
the EIB, and that coherency traffic can be sent to other CELL processors via
a coherent interface, means that the CELL processor can indeed be an
interesting processor. If nothing else, the CELL processor should enable
startups that propose to build FlexIO based coherency switches to garner
immediate interest from venture capitalists.



Summary
The CELL processor presents an intriguing alternative in its pursuit of
performance. It seems to be a forgone conclusion that the CELL processor
will be an enormously successful product, and that millions of CELL
processors will be sold as the processors that power the next generation
Sony Playstation. However, IBM has designed some features into the CELL
processor that clearly reveals its ambition in seeking new applications for
the CELL processor. At ISSCC 2005, much fanfare has been generated by the
rating of 256 GFlops @ 4 GHz for the CELL processor. However, it is the
little mentioned double precision capability and the yet undisclosed system
level coherency mechanism that appear to be the most intriguing aspects that
could enable the CELL processor to find success not just inside the
Playstation, but outside of it as well.

References
[1] J. Silberman et. al., "A 1.0- GHz Single-Issue 64-Bit PowerPC Integer
Processor", IEEE Journal of Solid-State Circuits, Vol 33, No.11, Nov 1998.
[2] P. Hofstee et. al., "A 1 GHz Single-Issue 64b PowerPC Processor",
International Solid-State Circuits Conference Technical Digest, Feb. 2000.
[3] N. Rohrer et. al. "PowerPC in 130nm and 90nm Technologies",
International Solid-State Circuits Conference Technical Digest, Feb. 2004.
[4] B. Flachs et. al. "A Streaming Processing Unit for A CELL Processor",
International Solid-State Circuits Conference Technical Digest, Feb. 2005.
[5] D. Pham et. al. "The Design and Implementation of a First-Generation
CELL Processor", International Solid-State Circuits Conference Technical
Digest, Feb. 2005.
[6] J. Kuang et. al. "A Double-Precision Multiplier with Fine-Grained
Clock-Gating Support for a First-Generation CELL Processor", International
Solid-State Circuits Conference Technical Digest, Feb. 2005.
[7] S. Dhong et. al. "A 4.8 GHz Fully Pipelined Embedded SRAM in the
Streaming Processor of a CELL Processor", International Solid-State Circuits
Conference Technical Digest, Feb. 2005.
[8] K. Chang et. al. "Clocking and Circuit Design for a Parallel I/O on a
First-Generation CELL Processor", International Solid-State Circuits
Conference Technical Digest, Feb. 2005.
 
begin NEXT BOX wrote:

< snip idiotic large post >

Why do you think you have to clog the groups with this stuff? Anyone who is
interested in it could easily get it without posting this inane drivel to
offtopic groups

Could it be that you do it because you are a windows user, that is, by
default an incredibly stupid dimwit?
 
Tony Hill said:
On Thu, 10 Feb 2005 07:48:06 -0500, "Fetch, Rover, Fetch"

I'm not sure it's fair to call it "cache" -- the term used in the
press releases were, IIRC, "local memory", which I suspect implies
that they must be managed directly (probably) by the main core.

(I guess this was the real origin of all this nonsense about
distributing packets of instructions and data all over the internet
for computation. Or whatever the hype was made to sound like)
Sure, they could, but they wouldn't be anywhere near as fast as any
current processors unless you can make use of the vector engine.
And doing that means changing ALL of the software, and that is
*EXPENSIVE*.

Only Photoshop -- this is Apple, you know :-) (Photoshop filters could
probably be made to scream on Cell.)
We will definitely NOT see the Cell processor (or it's descendants)
showing up in any meaningful way (ie not just a couple special-purpose
workstations from IBM) as the main processor in desktop computers for
AT LEAST 5 years, probably more like 10+ years if ever. And yes, you
can quote me on that.

While I'd like to have virtualization on my PC, I'm not convinced of
its uses in PS3. Was it simply too difficult to remove from the POWER
core, or what? :-)

-kzm
 
also if you had bothered to read -
the cell's G5 carries a 32K level cache, and a 512K level 2 cache
while the 'satellite' cpus carry the 256K cache

I'm pretty sure the co-processor cores have 256K of fast memory,
*but it's not a cache*. It's addressable memory; the software has to
manage moving things between it and main memory.

followups to RGV.sony
 
Alex said:
Actually, intel also has a streamlined processor with *16* additional
processors on the same chip. and a RDRAM interface. and honking I/O
bandwidth.

http://www.intel.com/design/network/products/npfamily/docs/ixp2800_docs.htm#Datasheets

Similar concept. Very different purpose. The ixp2800 uses an ARM core
with no FPU where-as the Cell uses a POWER core with the full
instruction set and all the necessary execution units. The ixp2800
microengines are specialized units for data transfer, they move things
around, while the Cell coprocessors are specialized unis for FP vector
processing, the calculate things. The ideas behind the two chips are
very similar, but they are not at all competitors...totally different
specializations.

Alex
 
Tony Hill wrote:

....
Even in SPEC CFP2000, kind of the ideal benchmark for this
super-bandwidth setup, the EV7 is still being beaten by current x86
processors. When looking at other tests it's even less impressive.

Yeah - really pathetic performance from a 1998 core two full process
generations out of date.

- bill
 
Tony Hill poked his little head through the XP firewall and said:
Riiiiigghtttt.... And just WHO is going to write the software for my
networked toilet? Please tell me it's not Microsoft, because that's
one place I do NOT want a blue-screen! :>

You mean a blue-bowl, don't you? According to the teevee, many American
housewives have orgasms over the sight of a blue toilet bowl.
 
In comp.arch NEXT BOX said:
Back to Basics
The fundamental task of a processor is to manage the flow of data through
its computational units. However in the past two decades, each successive
generation of processors for personal computers has added more transistors

[The rest of my article snipped]

Dear "NEXT BOX",

You managed to copy everything except the following copyright
statement. You are expressly forbidden to do what you did. Please
cease and desist. I will be contacting your ISP shortly.

Thank you for your attention to this matter.
 
Riiiiigghtttt.... And just WHO is going to write the software for my
networked toilet? Please tell me it's not Microsoft, because that's
one place I do NOT want a blue-screen! :>

I'm working on this top secret OS. I'm stuck choosing between the
names Cancer or Prison for this Cell OS. And don't worry, I'll be
using pink instead of blue :ppPPPp

--
L.Angel: I'm looking for web design work.
If you need basic to med complexity webpages at affordable rates, email me :)
Standard HTML, SHTML, MySQL + PHP or ASP, Javascript.
If you really want, FrontPage & DreamWeaver too.
But keep in mind you pay extra bandwidth for their bloated code
 
a?n?g?e? said:
I'm working on this top secret OS. I'm stuck choosing between the
names Cancer or Prison for this Cell OS. And don't worry, I'll be
using pink instead of blue :ppPPPp

Pink? ..fOr a Cell? It *must* be blue. The pink company went pork and
was bought out by the 'Q', which...
 
Bill Todd said:
Tony Hill wrote:
Yeah - really pathetic performance from a 1998 core two full process
generations out of date.


Uh, released in 2003, not 1998. If you want to talk about when the core
design was started, I could make a case that the Pentium-M is a 1992 core :)
 
Soni tempori elseu romani yeof helsforo nisson ol sefini ill des Fri, 11 Feb
2005 16:52:21 +0000 (UTC), sefini jorgo geanyet des mani yeof do
[The rest of my article snipped]

Dear "NEXT BOX",

You managed to copy everything except the following copyright
statement. You are expressly forbidden to do what you did. Please
cease and desist. I will be contacting your ISP shortly.

Thank you for your attention to this matter.

You will have a small army of supporters, Mr Wang. May I suggest you take it
further than just contacting his ISP? You'd be doing the whole of Usenet a
great service if you got him banged up for this :)


deKay
 
Uh, released in 2003, not 1998. If you want to talk about when the core
design was started, I could make a case that the Pentium-M is a 1992 core :)

Or Itanic a '94 core. ;-)
 
Back
Top