R
R420
http://www.linuxinsider.com/story/34548.html
Fast, Faster and IBM's PlayStation 3 Processor
By Paul Murphy
LinuxInsider
06/17/04 6:38 AM PT
In practice, Apple has never succeeded in getting the bulk of its
developers to make effective use of the Altivec, and Sun has had
essentially no success getting people outside the military and
intelligence communities to use the four-way SIMD capabilities built
into its Sparc processors.
Three years ago, IBM (NYSE: IBM) , Sony (NYSE: SNE) and Toshiba
announced a partnership aimed at developing a new processor for use in
digital entertainment devices like the PlayStation. Since then, the
product has seen a billion dollars in development work. Two fabs, one
in Tokyo and one in Fishkills, New York, have been custom-built to
make the new processor in large volumes. On May 12th, IBM announced
that the first commercial workstations based on this processor would
become available to game-industry developers late this year.
A lot is known about this processor as planned, but relatively little
real information about the product as built has yet leaked. To the
extent that performance information has become available, it is
characterized by numbers so high that most people simply dismissed the
reports. In November of last year, for example, a senior Sony
executive told an internal audience that implementations would scale
from uniprocessors to 64-way groupings that would deliver in excess of
two teraflops -- making it more than 10 times faster than Xeon.
Most of what we know about this machine comes from U.S. patent
#6,526,491 as issued to Sony in February 2003 for a "memory protection
system and method for computer architecture for broadband networks."
Here's the abstract:
A computer architecture and programming model for high speed
processing over broadband networks are provided. The architecture
employs a consistent modular structure, a common computing module and
uniform software cells. The common computing module includes a control
processor, a plurality of processing units, a plurality of local
memories from which the processing units process programs, a direct
memory access controller and a shared main memory.
A synchronized system and method for the coordinated reading and
writing of data to and from the shared main memory by the processing
units also are provided. A hardware sandbox structure is provided for
security against the corruption of data among the programs being
processed by the processing units. The uniform software cells contain
both data and applications and are structured for processing by any of
the processors of the network. Each software cell is uniquely
identified on the network. A system and method for creating a
dedicated pipeline for processing streaming data also are provided.
The machine is widely referred to as a cell processor, but the cells
involved are software, not hardware. Thus a cell is a kind of TCP
packet on steroids, containing both data and instructions and linked
back to the task of which it forms part via unique identifiers that
facilitate results assembly just as the TCP sequence number does.
Outrageous Performance Claims
The basic processor itself appears to be a PowerPC derivative with
high-speed built-in local communications, high-speed access to local
memory, and up to eight attached processing units broadly akin to the
Altivec short array processor used by Apple (Nasdaq: AAPL) . The
actual product consists of one to eight of these on a chip -- a true
grid-on-a-chip approach in which a four-way assembly can, when fully
populated, consist of four core CPUs, 32 attached processing units and
512 MB of local memory.
The per-cycle performance of the core CPU is undocumented but may be
expected to be comparable to other PowerPC machines running at high
cache hit rates. Specifications for the four or eight attached
processors comprising the array are known; these are expected to turn
in one floating point operation per cycle or around 32 Gigaflops for
the fully populated array at a nominal 4 GHz.
That's where the apparently outrageous performance claims come from; a
four-way assembly running at a planned 4 GHz offers 32 x 4 = 128
Gigaflops in potential floating-point execution. A 64-way supergrid
made by stacking eight eight-way assemblies would have a total of 512
attached processors and could, therefore, break 2 teraflops if data
transportation kept up with the processors.
In practice, however, Apple has never succeeded in getting the bulk of
its developers to make effective use of the Altivec, and Sun has had
essentially no success getting people outside the military and
intelligence communities to use the four-way SIMD capabilities built
into its Sparc processors. Grid computing is slowly entering the
commercial mainstream, but combining both local-array access with grid
computing requires a significant shift in programming paradigm that
will not appeal to the mainstream Wintel and IBM customer base.
Gains Outweigh the Pain
For games developers, however, the potential gains -- up to 50 times
the best x86-based processor and graphics board combinations can
deliver -- should outweigh the pain. Even minor software change, the
kind of thing Adobe does to take advantage of the Altivec in
Photoshop, should offer significant advantages to a wider programming
community and enable floating-point-intensive applications to run a
full order of magnitude more quickly on this machine than on Intel's
(Nasdaq: INTC) best.
An important point to bear in mind is that this processor will be
inexpensive, and systems built around it even less expensive because
no external graphics or network boards will be needed. Both Sony and
IBM have been building fabs specifically to make this device. Volumes
will be high because Sony will use up to 20 million assemblies in the
PlayStation, while 10 million or more that don't quite make the
quality cut will get used in its digital televisions and other
products.
Very little has been publicly revealed about the operating system for
this thing, but it is quite obvious what it has to be and how it has
to work. Each core will have its own local Unix kernel, with most just
executing cells as they arrive from the dispatch manager and one
managing the traffic-coordination hardware. In all likelihood, the
kernel used will prove to be both Linux-derived and Linux-compatible
-- meaning that most Linux software will run out of the box on the
uniprocessor configuration while software adapted for the grid
environment will run unchanged on everything from the uniprocessor to
configurations with hundreds or even thousands of processor
assemblies.
As users of Sun's open-source grid software have found, performance
losses on single processes increase as you add processors because data
flow and timing control issues increase in complexity nonlinearly with
system growth. Fundamentally, what happens is that the larger you make
the total machine, whether on one piece of silicon or in a rack, the
more cell transit time dominates execution time and the greater the
performance cost imposed by the need to coordinate operations.
New Generation of Linux PCs
The patent mentions the use of no-ops (processor nulls) inserted into
cells to get around timing problems associated with having components
run at different speeds -- with processor coordination initially
enforced by setting TTL-like time budgets for cell execution. My
guess, however, is that advances in cell isolation and programming for
asynchronous event handling have since obsolesced those solutions.
I expect, therefore, that when the real thing appears, it will fully
support both the traditional grid format for on-chip work and an
asynchronous hypergrid for multi-assembly processes on the model
Thinking Machines hoped to achieve with the transputer-based hypercube
in 1985 -- and that NSA is rumored to actually have built on 1989's
Sparc-SIMD-based CM-5.
Either way, however, the OS for this machine is likely to offer both
Linux compatibility at the low end and enormous scalability for those
willing to modify their software -- which is why, as I discuss in next
week's column, I expect IBM and Toshiba soon to launch a new
generation of Linux PCs built around the combination of this CPU with
IBM software products like Lotus Workspace for Linux.
Fast, Faster and IBM's PlayStation 3 Processor
By Paul Murphy
LinuxInsider
06/17/04 6:38 AM PT
In practice, Apple has never succeeded in getting the bulk of its
developers to make effective use of the Altivec, and Sun has had
essentially no success getting people outside the military and
intelligence communities to use the four-way SIMD capabilities built
into its Sparc processors.
Three years ago, IBM (NYSE: IBM) , Sony (NYSE: SNE) and Toshiba
announced a partnership aimed at developing a new processor for use in
digital entertainment devices like the PlayStation. Since then, the
product has seen a billion dollars in development work. Two fabs, one
in Tokyo and one in Fishkills, New York, have been custom-built to
make the new processor in large volumes. On May 12th, IBM announced
that the first commercial workstations based on this processor would
become available to game-industry developers late this year.
A lot is known about this processor as planned, but relatively little
real information about the product as built has yet leaked. To the
extent that performance information has become available, it is
characterized by numbers so high that most people simply dismissed the
reports. In November of last year, for example, a senior Sony
executive told an internal audience that implementations would scale
from uniprocessors to 64-way groupings that would deliver in excess of
two teraflops -- making it more than 10 times faster than Xeon.
Most of what we know about this machine comes from U.S. patent
#6,526,491 as issued to Sony in February 2003 for a "memory protection
system and method for computer architecture for broadband networks."
Here's the abstract:
A computer architecture and programming model for high speed
processing over broadband networks are provided. The architecture
employs a consistent modular structure, a common computing module and
uniform software cells. The common computing module includes a control
processor, a plurality of processing units, a plurality of local
memories from which the processing units process programs, a direct
memory access controller and a shared main memory.
A synchronized system and method for the coordinated reading and
writing of data to and from the shared main memory by the processing
units also are provided. A hardware sandbox structure is provided for
security against the corruption of data among the programs being
processed by the processing units. The uniform software cells contain
both data and applications and are structured for processing by any of
the processors of the network. Each software cell is uniquely
identified on the network. A system and method for creating a
dedicated pipeline for processing streaming data also are provided.
The machine is widely referred to as a cell processor, but the cells
involved are software, not hardware. Thus a cell is a kind of TCP
packet on steroids, containing both data and instructions and linked
back to the task of which it forms part via unique identifiers that
facilitate results assembly just as the TCP sequence number does.
Outrageous Performance Claims
The basic processor itself appears to be a PowerPC derivative with
high-speed built-in local communications, high-speed access to local
memory, and up to eight attached processing units broadly akin to the
Altivec short array processor used by Apple (Nasdaq: AAPL) . The
actual product consists of one to eight of these on a chip -- a true
grid-on-a-chip approach in which a four-way assembly can, when fully
populated, consist of four core CPUs, 32 attached processing units and
512 MB of local memory.
The per-cycle performance of the core CPU is undocumented but may be
expected to be comparable to other PowerPC machines running at high
cache hit rates. Specifications for the four or eight attached
processors comprising the array are known; these are expected to turn
in one floating point operation per cycle or around 32 Gigaflops for
the fully populated array at a nominal 4 GHz.
That's where the apparently outrageous performance claims come from; a
four-way assembly running at a planned 4 GHz offers 32 x 4 = 128
Gigaflops in potential floating-point execution. A 64-way supergrid
made by stacking eight eight-way assemblies would have a total of 512
attached processors and could, therefore, break 2 teraflops if data
transportation kept up with the processors.
In practice, however, Apple has never succeeded in getting the bulk of
its developers to make effective use of the Altivec, and Sun has had
essentially no success getting people outside the military and
intelligence communities to use the four-way SIMD capabilities built
into its Sparc processors. Grid computing is slowly entering the
commercial mainstream, but combining both local-array access with grid
computing requires a significant shift in programming paradigm that
will not appeal to the mainstream Wintel and IBM customer base.
Gains Outweigh the Pain
For games developers, however, the potential gains -- up to 50 times
the best x86-based processor and graphics board combinations can
deliver -- should outweigh the pain. Even minor software change, the
kind of thing Adobe does to take advantage of the Altivec in
Photoshop, should offer significant advantages to a wider programming
community and enable floating-point-intensive applications to run a
full order of magnitude more quickly on this machine than on Intel's
(Nasdaq: INTC) best.
An important point to bear in mind is that this processor will be
inexpensive, and systems built around it even less expensive because
no external graphics or network boards will be needed. Both Sony and
IBM have been building fabs specifically to make this device. Volumes
will be high because Sony will use up to 20 million assemblies in the
PlayStation, while 10 million or more that don't quite make the
quality cut will get used in its digital televisions and other
products.
Very little has been publicly revealed about the operating system for
this thing, but it is quite obvious what it has to be and how it has
to work. Each core will have its own local Unix kernel, with most just
executing cells as they arrive from the dispatch manager and one
managing the traffic-coordination hardware. In all likelihood, the
kernel used will prove to be both Linux-derived and Linux-compatible
-- meaning that most Linux software will run out of the box on the
uniprocessor configuration while software adapted for the grid
environment will run unchanged on everything from the uniprocessor to
configurations with hundreds or even thousands of processor
assemblies.
As users of Sun's open-source grid software have found, performance
losses on single processes increase as you add processors because data
flow and timing control issues increase in complexity nonlinearly with
system growth. Fundamentally, what happens is that the larger you make
the total machine, whether on one piece of silicon or in a rack, the
more cell transit time dominates execution time and the greater the
performance cost imposed by the need to coordinate operations.
New Generation of Linux PCs
The patent mentions the use of no-ops (processor nulls) inserted into
cells to get around timing problems associated with having components
run at different speeds -- with processor coordination initially
enforced by setting TTL-like time budgets for cell execution. My
guess, however, is that advances in cell isolation and programming for
asynchronous event handling have since obsolesced those solutions.
I expect, therefore, that when the real thing appears, it will fully
support both the traditional grid format for on-chip work and an
asynchronous hypergrid for multi-assembly processes on the model
Thinking Machines hoped to achieve with the transputer-based hypercube
in 1985 -- and that NSA is rumored to actually have built on 1989's
Sparc-SIMD-based CM-5.
Either way, however, the OS for this machine is likely to offer both
Linux compatibility at the low end and enormous scalability for those
willing to modify their software -- which is why, as I discuss in next
week's column, I expect IBM and Toshiba soon to launch a new
generation of Linux PCs built around the combination of this CPU with
IBM software products like Lotus Workspace for Linux.