The EIB on Cell?

Robert Myers · Mar 8, 2005

Greetings!

http://arstechnica.com/articles/paedia/cpu/cell-2.ars

has a nice, if puzzling, layout of the Cell processor architecture.
What puzzles me is the "EIB," which, wideband though it may be, and
operating at a modest frequency, has *eleven* connections to what is
diagrammed as a single shared bus.

<quote>

The individual SPEs can use this bus to communicate with each other,
and this includes the transfer of data in between SPEs acting as peers
on the network. The SPEs also communicate with the L2 cache, with main
memory (via the MIC), and with the rest of the system (via the BIC).
The onboard memory interface controller (MIC) supports the new Rambus
XDR memory standard, and the BIC (which I think stands for "bus
interface controller" but I'm not 100% sure) has a coherent interface
for SMP and a non-coherent interface for I/O.

</quote>

Seems like that's a great deal of traffic and many of drops for one
bus. Any thoughts?

RM

David Wang · Mar 8, 2005

Robert Myers said:
Greetings!

has a nice, if puzzling, layout of the Cell processor architecture.
What puzzles me is the "EIB," which, wideband though it may be, and
operating at a modest frequency, has *eleven* connections to what is
diagrammed as a single shared bus.

The individual SPEs can use this bus to communicate with each other,
and this includes the transfer of data in between SPEs acting as peers
on the network. The SPEs also communicate with the L2 cache, with main
memory (via the MIC), and with the rest of the system (via the BIC).
The onboard memory interface controller (MIC) supports the new Rambus
XDR memory standard, and the BIC (which I think stands for "bus
interface controller" but I'm not 100% sure) has a coherent interface
for SMP and a non-coherent interface for I/O.

Seems like that's a great deal of traffic and many of drops for one
bus. Any thoughts?

Not a multi-drop bus.

It's a repeater ring.

http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318&p=9

More SPE's you add, more cycles it takes for data to hop across the
ring, clockwise or counter-clockwise.

The data rings operages at 4 GHz. The control part operates at half
of that freq.

I am guessing that the control is a multidrop bus, although I do not
know for certain.

Robert Myers · Mar 8, 2005

Not a multi-drop bus.

It's a repeater ring.

http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318&p=9

More SPE's you add, more cycles it takes for data to hop across the
ring, clockwise or counter-clockwise.

The data rings operages at 4 GHz. The control part operates at half
of that freq.

I am guessing that the control is a multidrop bus, although I do not
know for certain.

Thanks. I had actually looked at your writeup on realworldtech. For
some reason, the (really very clear) diagram of the EIB interconnect
didn't register.

The one-hop interconnect is consistent with the streaming architecture
I had been expecting.

RM

daytripper · Mar 8, 2005

Not a multi-drop bus.

It's a repeater ring.

http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318&p=9

More SPE's you add, more cycles it takes for data to hop across the
ring, clockwise or counter-clockwise.

The data rings operages at 4 GHz. The control part operates at half
of that freq.

I am guessing that the control is a multidrop bus, although I do not
know for certain.

That'd pretty much clobber any point in running the ring at 4ghz, wouldn't it?
Hopefully, all control is in-band...

/daytripper

Larry R. Moore · Mar 11, 2005

Some additional thoughts on the EIB logic:

http://www.electronicsweekly.com/articles/article.asp?liArticleID=38754

"Connecting up the processing units is the element interconnect bus
(EIB), comprising four 128-bit rings and a 64-bit tag running at half
the processor clock."

I read this to say that both the rings and the tag are run at half the
processor clock frequency.

http://www.hpcaconf.org/hpca11/papers/25_hofstee-cellprocessor_final.pdf

"The processor is constructed around a high bandwidth on-chip SMP
fabric capable of supporting up to 96 bytes per processor cycle total
using up to 12 simultaneous transfers."

Dr. Hofstee then lists the eleven elements supported by this bus on
CELL: the PPE, eight SPE's, XDR memory and FlexIO.

"Each of the listed elements has 8 byte wide (relative to processor
frequency) inbound and outbound interfaces to the coherent on-chip bus,
except the I/O interface unit which has two 8 byte interfaces."

There are, therefore a total of twelve (12) I/O interfaces to the bus.
Since there are 12 outbound interfaces and 12 inbound interfaces on the
bus, a maximum of twelve simultaneous 8-byte transfers (say to adjacent
elements,) can take place with each processor clock cycle. This is
consistent with the 96 bytes per processor cycle stated in the first
article.

Hofstee is careful to define the 8 byte interfaces as "relative to
processor frequency". If the bus is running at one half of the
processor clock as specified by the first article, the I/O interfaces
are effectively 16 bytes wide, equal to the word width of 128 bits. The
16 byte word is probably input to the selected ring as two 8 byte
syllables at the processor frequency.

A "tag" as mentioned in the first article is something that usually
travels with a "package", in this case, the data. I don't understand
how a single tag can travel with data moving in two directions. Is it
possible that the tag is communicated somehow to the bus interface
controller (BIC)? The BIC must be told the destination of the data in
order to schedule its path. It would seem to me that there must be a
tag generated at each element outbound interface for each transmission
of data.

It is interesting that Hofstee refers to the EIB as a "fabric" because
switching fabrics require that data packets be tagged with destination
address at the very least. It is possible (pure speculation, of course)
that the 64 bit tags indicate source and destination addresses, and
number of words in the data packet. Hmmmm. This sounds like FlexIO
(similar to RapidIO). If you combine the EIB and the BIC, they do look
like a packet switching fabric with twelve ports.

On the other hand, IBM may have implemented something much simpler. As
a designer of image processing hardware/software (at the systems
level), I have seen systems (such as Datacube) that required the
datapath be predefined for the algorithm. It was configured on startup
and remained static while processing images. IBM wouldn't do that,
would they? I don't think so.

David Wang · Mar 11, 2005

Larry R. Moore said:
Some additional thoughts on the EIB logic:

"Connecting up the processing units is the element interconnect bus
(EIB), comprising four 128-bit rings and a 64-bit tag running at half
the processor clock."

I read this to say that both the rings and the tag are run at half the
processor clock frequency.

I had read this to mean that the data rings and tag runs at
different frequencies, which didn't concern me much because they
were going to different places.

http://www.hpcaconf.org/hpca11/papers/25_hofstee-cellprocessor_final.pdf

Hofstee is careful to define the 8 byte interfaces as "relative to
processor frequency". If the bus is running at one half of the
processor clock as specified by the first article, the I/O interfaces
are effectively 16 bytes wide, equal to the word width of 128 bits. The
16 byte word is probably input to the selected ring as two 8 byte
syllables at the processor frequency.

A "tag" as mentioned in the first article is something that usually
travels with a "package", in this case, the data. I don't understand
how a single tag can travel with data moving in two directions. Is it
possible that the tag is communicated somehow to the bus interface
controller (BIC)? The BIC must be told the destination of the data in
order to schedule its path. It would seem to me that there must be a
tag generated at each element outbound interface for each transmission
of data.

My understanding of the EIB is that the rings are controlled by the
switching network actually labelled as EIB in the center of the chip.
The rings themselves as physical wires runs over parts of the SPE, and
the EIB reaches into the SPE's to direct on/off/repeat buffer
operations. The scheduling for the EIB is coordinated with the little
block labelled as MBL (Master Bus Logic? I'm not sure) The BIC
controls the FlexIO, not the EIB. The FlexIO block is a special
circuit that Rambus developed for IBM's 90nm SOI process, and the BIC
is the logic that drives the FlexIO circuits.

Anyways, the reason why I think the "tag" runs to/from different places
is this: Data doesn't have to travel to all SPE's, it just has to travel
from source to destination, but the tag's have to be broadcast to all
SPE's due to the fact that the SPE's do have to snoop the tag (bus?)
for coherency of addresses in the host processor's address space.

That's why I think that the tag part is a differnt sort of animal
that lets you broadcast things, put address request on it, the EIB
controller then sets up the switching fabric that directs the
on/off/pass through operations on the data rings.

It is interesting that Hofstee refers to the EIB as a "fabric" because
switching fabrics require that data packets be tagged with destination
address at the very least. It is possible (pure speculation, of course)
that the 64 bit tags indicate source and destination addresses, and
number of words in the data packet. Hmmmm. This sounds like FlexIO
(similar to RapidIO). If you combine the EIB and the BIC, they do look
like a packet switching fabric with twelve ports.

I don't think it's a packet network. I think the data rings just carry
data, and the request is encapsulated in the tag ring/bus/blah, that
request goes through the EIB/MBL, sets up the transfer, and tells the
destination guy that something is coming.

All of this is based on my understanding of the EIB, which may contain
inaccuracies. I can queue up the question and send the the CELL guys
a list of questions once I'm done with the list and see if I can get a
clarification on the mechanism.

Robert Myers · Mar 11, 2005

My understanding of the EIB is that the rings are controlled by the
switching network actually labelled as EIB in the center of the chip.
The rings themselves as physical wires runs over parts of the SPE, and
the EIB reaches into the SPE's to direct on/off/repeat buffer
operations. The scheduling for the EIB is coordinated with the little
block labelled as MBL (Master Bus Logic? I'm not sure) The BIC
controls the FlexIO, not the EIB. The FlexIO block is a special
circuit that Rambus developed for IBM's 90nm SOI process, and the BIC
is the logic that drives the FlexIO circuits.

Anyways, the reason why I think the "tag" runs to/from different places
is this: Data doesn't have to travel to all SPE's, it just has to travel
from source to destination, but the tag's have to be broadcast to all
SPE's due to the fact that the SPE's do have to snoop the tag (bus?)
for coherency of addresses in the host processor's address space.

That's why I think that the tag part is a differnt sort of animal
that lets you broadcast things, put address request on it, the EIB
controller then sets up the switching fabric that directs the
on/off/pass through operations on the data rings.

How do you envision that working, in practice? The on/off/pass
through setting of the interface at each SPE is programmed on time?
Pass through so many clock ticks, consume data for so many clock
ticks? How does the consuming SPE know what data it is getting?

RM

David Wang · Mar 11, 2005

How do you envision that working, in practice? The on/off/pass
through setting of the interface at each SPE is programmed on time?

Yes.

The way I imagine the EIB working is based on the notion that
"the user controls all data movement explicitly" via software
managed thread. So all the "tags" (i.e. requests) are going
to be initiated by the PPE. That is sent to the MBL, which
programs the EIB control for the on/off/pass through operations
as the data streams are moved from point A to point B. That
tag has to be sent from the PPE/MBL through the tag structure
to all SPE's in the processor, so they can have a chance to
snoop it if and intervene. The "intervention" mechanism in
turn means that the SPE's must be able to respond in some way
via the same tag structures back to PPE/MBL.

The EIB controller knows how long to hold each repeater element,
so when it's done, it just releases the switch and the switching
element can then be used for the construction of another set of
pipes to direct dataflow.

Pass through so many clock ticks, consume data for so many clock
ticks? How does the consuming SPE know what data it is getting?

The EIB controller will just have to hold the switches for as many
ticks as required. This jively nicely with the description of
"reserving channel capacity deterministically".

The SPE knows what data it is getting because the tag structure
interface reaches into the SPE and touches the DMA engine. The
DMA engine knows where the data is coming from and where in LS to
put that data. Or where in LS it should grab the data from and
how much of it to put onto the on ramp of the EIB.

<disclaimer>

Based on my understanding of the EIB control flow mechanism,
obtained from a 20 minute chat with the DE who designed the
EIB, with diagrams drawn literally on the back of a napkin.
It may contain inaccuracies due to faulty memory or incorrect
interpretation of statements.

</disclaimer>

Robert Myers · Mar 11, 2005

Yes.

The way I imagine the EIB working is based on the notion that
"the user controls all data movement explicitly" via software
managed thread. So all the "tags" (i.e. requests) are going
to be initiated by the PPE. That is sent to the MBL, which
programs the EIB control for the on/off/pass through operations
as the data streams are moved from point A to point B. That
tag has to be sent from the PPE/MBL through the tag structure
to all SPE's in the processor, so they can have a chance to
snoop it if and intervene. The "intervention" mechanism in
turn means that the SPE's must be able to respond in some way
via the same tag structures back to PPE/MBL.

The EIB controller knows how long to hold each repeater element,
so when it's done, it just releases the switch and the switching
element can then be used for the construction of another set of
pipes to direct dataflow.

That makes the network like a circuit-switched telephone network.
Producer and consumer have a reserved connection until the
communication is complete, with the MBL giving a fast busy when no
circuit is available.

The EIB controller will just have to hold the switches for as many
ticks as required. This jively nicely with the description of
"reserving channel capacity deterministically".

The SPE knows what data it is getting because the tag structure
interface reaches into the SPE and touches the DMA engine. The
DMA engine knows where the data is coming from and where in LS to
put that data. Or where in LS it should grab the data from and
how much of it to put onto the on ramp of the EIB.

Presumably there is a protocol that we have yet to learn about,
although I suppose the only thing a software type needs to know about
is the interface to the protocol.

<disclaimer>

Based on my understanding of the EIB control flow mechanism,
obtained from a 20 minute chat with the DE who designed the
EIB, with diagrams drawn literally on the back of a napkin.
It may contain inaccuracies due to faulty memory or incorrect
interpretation of statements.

</disclaimer>

What? Like you're going to get sued over a usenet post? ;-).

RM

Larry R. Moore · Mar 11, 2005

David said:
http://www.electronicsweekly.com/articles/article.asp?liArticleID=38754

I had read this to mean that the data rings and tag runs at
different frequencies, which didn't concern me much because they
were going to different places.
http://www.hpcaconf.org/hpca11/papers/25_hofstee-cellprocessor_final.pdf

My understanding of the EIB is that the rings are controlled by the
switching network actually labelled as EIB in the center of the chip.
The rings themselves as physical wires runs over parts of the SPE, and
the EIB reaches into the SPE's to direct on/off/repeat buffer
operations. The scheduling for the EIB is coordinated with the little
block labelled as MBL (Master Bus Logic? I'm not sure) The BIC
controls the FlexIO, not the EIB. The FlexIO block is a special
circuit that Rambus developed for IBM's 90nm SOI process, and the BIC

is the logic that drives the FlexIO circuits.

Yes. I think you are correct in that the BIC is unrelated to the EIB. I
misread some comments and didn't study the diagrams.

But let me back up a little. When you refer to an "on/off/repeat
buffer", is this a buffer of 128 bits that can be written (on), read
(off), or transmitted to the next buffer on the ring (repeat)? Does
this mean that a word outbound from SPE#1 is moved to the SPE#2 buffer
in one bus cycle and then repeated to the SPE#3 inbound buffer in a
second bus cycle? Is it accurate to say that the data rings are
comprised of 128 twelve-stage shift registers, with additional logic at
each stage to support read/write functions? This is, of course, an
oversimplification. It must be a little more complicated than this
because the Hofstee paper suggests that there are both inbound and
outbound interfaces that can be used simultaneously to effect twelve
transfers in one bus cycle.

Anyways, the reason why I think the "tag" runs to/from different places
is this: Data doesn't have to travel to all SPE's, it just has to travel
from source to destination, but the tag's have to be broadcast to all
SPE's due to the fact that the SPE's do have to snoop the tag (bus?)
for coherency of addresses in the host processor's address space.

Oh! You believe that "tag" is the name of a bus? Perhaps it was meant
in the sense of "playing tag" with the data. They could be simple
signals controlling ring selection, on, off and repeat logic at each
interface. I don't know if 64 bits would be enough, though.

That's why I think that the tag part is a differnt sort of animal
that lets you broadcast things, put address request on it, the EIB
controller then sets up the switching fabric that directs the
on/off/pass through operations on the data rings.

You may be right. In order to avoid bus contention, the MBL could be
bus master, polling each of the interfaces for data transfer requests
and writing data transfer schedules. It must be an interesting
algorithm.

I don't think it's a packet network. I think the data rings just carry
data, and the request is encapsulated in the tag ring/bus/blah, that
request goes through the EIB/MBL, sets up the transfer, and tells the
destination guy that something is coming.

You have to say when, too.

Yes, it doesn't make sense to shove the tag onto a data ring that may
already be in use and blocked. A fabric switch would have packet
buffering and that capability has to be shoved back into the computing
elements. Better to send the tag to the scheduler, the MLB, right away.
I think it is safe to assume, however, that the data will be sent in
measured packets to minimize the number of tag requests.

David Wang · Mar 11, 2005

But let me back up a little. When you refer to an "on/off/repeat
buffer", is this a buffer of 128 bits that can be written (on), read
(off), or transmitted to the next buffer on the ring (repeat)? Does
this mean that a word outbound from SPE#1 is moved to the SPE#2 buffer
in one bus cycle and then repeated to the SPE#3 inbound buffer in a
second bus cycle? Is it accurate to say that the data rings are
comprised of 128 twelve-stage shift registers, with additional logic at
each stage to support read/write functions? This is, of course, an
oversimplification. It must be a little more complicated than this
because the Hofstee paper suggests that there are both inbound and
outbound interfaces that can be used simultaneously to effect twelve
transfers in one bus cycle.

I haven't thought about what happens when you have one buffer dumping
data off at the "off ramp", and the "on ramp" driving data onto the
next state. Perhaps that's where you'd lose half of your efficiency
@ 4 GHz, and get the 96 byte per second concurrency.

I'll think about it and draw myself a structure to illustrate the
dataflow later. Right now I'm supposed to be doing something more
productive.

Oh! You believe that "tag" is the name of a bus? Perhaps it was meant
in the sense of "playing tag" with the data. They could be simple
signals controlling ring selection, on, off and repeat logic at each
interface. I don't know if 64 bits would be enough, though.

I think of it as the "tag" of a block of memory. Sort of like a
tag for a cacheline, but this is a tag for a block of memory
explicitly passed to and from the LS. That's what contains the
request in terms of PPE address pointer/size/source/destination.

You may be right. In order to avoid bus contention, the MBL could be
bus master, polling each of the interfaces for data transfer requests
and writing data transfer schedules. It must be an interesting
algorithm.

I don't think the MBL needs to "poll" in the sense of asking each guy
what its status is. I'd imagine that a lookup table would exist
somewhere near the MBL that tells the MBL about which trasfer(s)
are occuring on the EIB, how long the trasnfer is for, and when
"resoruce X" will be free. The PPE can then tell the MBL to
schedule the next trasfer based on resource availability.

Yes, it doesn't make sense to shove the tag onto a data ring that may
already be in use and blocked. A fabric switch would have packet
buffering and that capability has to be shoved back into the computing
elements. Better to send the tag to the scheduler, the MLB, right away.
I think it is safe to assume, however, that the data will be sent in
measured packets to minimize the number of tag requests.

I think part of the request is "size". Doesn't seem like there's a need
for fixed size packets, I think since the LS is limited in size and
the user (through the use of the PPE) has explicit control as to when
the data migration occurs and when the data processing occurs, he/she
should be able to trade off packet sizes for specific applications.
Some threads may want to deal with 1 KB data chunks, while others may
be better with 64 KB data chunks. (Just a WAG)

Rambus aims for 1 TeraByte per second memory bandwidth by 2010	23	Dec 4, 2007
CELL 2 "Enhanced Cell Broadband Engine" to be revealed soon	2	Apr 12, 2007
Rambus working on next-gen RAM memory with 1 TeraByte/sec bandwidthperformance by 2010	1	Dec 3, 2007
ATI R520 / X1800 has a 512-Bit Bus + R5xxx naming	3	Sep 10, 2005
informed speculation on IBM's 'Broadway' processor	9	Nov 1, 2005
Anand: XBox360 CPU and PS3 Cell CPU have poor realworld performance	19	Jun 29, 2005
Understanding the Cell Microprocessor [VERY LONG]	12	Mar 18, 2005
Nikkie Electronics Asia: hints at twin-Cell chip configuration for PlayStation3	2	May 2, 2005

The EIB on Cell?

Robert Myers

David Wang

Robert Myers

daytripper

Larry R. Moore

David Wang

Robert Myers

David Wang

Robert Myers

Larry R. Moore

David Wang

Ask a Question

Similar Threads