P
parallax-scroll
Nvidia describes 10 teraflops processor
Rick Merritt
11/17/2010 10:47 AM EST
SAN JOSE, Calif. – Nvidia's chief scientist gave attendees at
Supercomputing 2010 a sneak peak of a future graphics chip that will
power an exascale computer. Nvidia is competing with three other teams
to build such a system by 2018 in a program funded by the U.S.
Department of Defense.
Nvidia's so-called Echelon system is just a paper design backed up by
simulations, so it could change radically before it gets built.
Elements of its chip designs ultimately are expected to show up across
the company's portfolio of handheld to supercomputer graphics
products.
"If you can do a really good job computing at one scale you can do it
at another," said Bill Dally, Nvidia's chief scientist who is heading
up the Echelon project. "Our focus at Nvidia is on performance per
watt [across all products], and we are starting to reuse designs
across the spectrum from Tegra to Tesla chips," he said.
In his talk, Dally described a graphics core that can process a
floating point operation using just 10 picojoules of power, down from
200 picojoules on Nvidia's current Fermi chips. Eight of the cores
would be packaged on a single streaming multiprocessor (SM) and 128 of
the SMs would be packed into one chip.
The result would be a thousand-core graphics chip with each core
capable of handling four double precision floating-point operations
per clock cycle—the equivalent of 10 teraflops on a chip. A chip with
just eight of the cores would someday power a handset, Dally said.
The Echelon chip packs just twice as many cores as today's high-end
Nvidia GPUs. However, today's cores handle just one double precision
floating-point operation per cycle, compared to four for the Echelon
chip.
Many of the advances in the chip come from its use of memory. The
Echelon chip will use 256 Mbytes of SRAM memory that can be
dynamically configured to meet the needs of an application.
For example, the SRAM could be broken up into as many as six levels of
cache, each of a variable size. At the lowest level each core would
have its own private cache.
The goal is to get data as close to processing elements as possible to
reduce the need to move data around the chip, wasting energy. Thus SMs
would have a hierarchy of processor registers that could be matched to
locations in cache levels. In addition, the chip would have broadcast
mechanisms so that the results of one task could be shared with any
nodes that needed that data.
Programming a 1,000-core processor
To ease programming, the design is cache coherent across both graphics
and traditional processor cores. Indeed, finding ways to program many-
core processors is one of the chief challenges for today's computer
scientists.
"We are about to see a sea change in programming models," said Dally.
"In high performance computing we went from vectorized Fortran to MPI
and now we need a new programming model for the next decade or so," he
said.
"We think it should be an evolution of [Nvidia's] CUDA," said Dally.
"But there are CUDA like approaches such as OpenCL, OpenMP and
[Microsoft's] DirectCompute or a whole new language," he said.
All the languages use similar ingredients. For example, they try to
build into their semantics support for advanced memory sharing
mechanisms.
Nvidia's Echelon system will compete with teams from Intel, MIT and
Sandia National Labs, each taking different approaches to build power
efficient exascale systems.
The Ubiquitous High Performance Computing program is sponsored by the
Defense Advanced Research Projects Agency. DARPA tasked the teams to
build by 2014 a prototype petaflop-class system into 57 kilowatt rack
prototype computer. Such systems could be used as building blocks to
create an exascale system to be built by 2018.
http://www.eetimes.com/ContentEETimes/Echelon chip.jpg
Nvidia's Echelon chip packs 1,000 cores in 128 blocks
http://www.eetimes.com/electronics-news/4210815/Nvidia-describes-10-teraflops-processor
Rick Merritt
11/17/2010 10:47 AM EST
SAN JOSE, Calif. – Nvidia's chief scientist gave attendees at
Supercomputing 2010 a sneak peak of a future graphics chip that will
power an exascale computer. Nvidia is competing with three other teams
to build such a system by 2018 in a program funded by the U.S.
Department of Defense.
Nvidia's so-called Echelon system is just a paper design backed up by
simulations, so it could change radically before it gets built.
Elements of its chip designs ultimately are expected to show up across
the company's portfolio of handheld to supercomputer graphics
products.
"If you can do a really good job computing at one scale you can do it
at another," said Bill Dally, Nvidia's chief scientist who is heading
up the Echelon project. "Our focus at Nvidia is on performance per
watt [across all products], and we are starting to reuse designs
across the spectrum from Tegra to Tesla chips," he said.
In his talk, Dally described a graphics core that can process a
floating point operation using just 10 picojoules of power, down from
200 picojoules on Nvidia's current Fermi chips. Eight of the cores
would be packaged on a single streaming multiprocessor (SM) and 128 of
the SMs would be packed into one chip.
The result would be a thousand-core graphics chip with each core
capable of handling four double precision floating-point operations
per clock cycle—the equivalent of 10 teraflops on a chip. A chip with
just eight of the cores would someday power a handset, Dally said.
The Echelon chip packs just twice as many cores as today's high-end
Nvidia GPUs. However, today's cores handle just one double precision
floating-point operation per cycle, compared to four for the Echelon
chip.
Many of the advances in the chip come from its use of memory. The
Echelon chip will use 256 Mbytes of SRAM memory that can be
dynamically configured to meet the needs of an application.
For example, the SRAM could be broken up into as many as six levels of
cache, each of a variable size. At the lowest level each core would
have its own private cache.
The goal is to get data as close to processing elements as possible to
reduce the need to move data around the chip, wasting energy. Thus SMs
would have a hierarchy of processor registers that could be matched to
locations in cache levels. In addition, the chip would have broadcast
mechanisms so that the results of one task could be shared with any
nodes that needed that data.
Programming a 1,000-core processor
To ease programming, the design is cache coherent across both graphics
and traditional processor cores. Indeed, finding ways to program many-
core processors is one of the chief challenges for today's computer
scientists.
"We are about to see a sea change in programming models," said Dally.
"In high performance computing we went from vectorized Fortran to MPI
and now we need a new programming model for the next decade or so," he
said.
"We think it should be an evolution of [Nvidia's] CUDA," said Dally.
"But there are CUDA like approaches such as OpenCL, OpenMP and
[Microsoft's] DirectCompute or a whole new language," he said.
All the languages use similar ingredients. For example, they try to
build into their semantics support for advanced memory sharing
mechanisms.
Nvidia's Echelon system will compete with teams from Intel, MIT and
Sandia National Labs, each taking different approaches to build power
efficient exascale systems.
The Ubiquitous High Performance Computing program is sponsored by the
Defense Advanced Research Projects Agency. DARPA tasked the teams to
build by 2014 a prototype petaflop-class system into 57 kilowatt rack
prototype computer. Such systems could be used as building blocks to
create an exascale system to be built by 2018.
http://www.eetimes.com/ContentEETimes/Echelon chip.jpg
Nvidia's Echelon chip packs 1,000 cores in 128 blocks
http://www.eetimes.com/electronics-news/4210815/Nvidia-describes-10-teraflops-processor