N
NV55
http://www.beyond3d.com/content/reviews/51
NVIDIA GT200 GPU and Architecture Analysis
Published on 16th Jun 2008, written by Rys for Consumer Graphics -
Last updated: 15th Jun 2008
Introduction
Sorry G80, your time is up.
There's no arguing that NVIDIA's flagship D3D10 GPU has held a reign
over 3D graphics that never truly saw it usurped, even by G92 and a
dubiously named GeForce 9-series range. The high-end launch product
based on G80, GeForce 8800 GTX, is still within spitting distance of
anything that's come out since in terms of raw single-chip
performance. It flaunts its 8 clusters, 384-bit memory bus and 24 ROPs
in the face of G92, meaning that products like 9800 GTX have never
really felt like true upgrades to owners of G80-based products.
That I type this text on my own PC powered by a GeForce 8800 GTX, one
that I bought -- which is largely unheard of in the world of tech
journalism; as a herd, we never usually buy PC components -- with my
own hard-earned, and on launch day no less, speaks wonders for the
chip's longevity. I'll miss you old girl, your 20 month spell at the
top of the pile is now honestly up. So what chip the usurper, and how
far has it moved the game on?
Rumours about GT200 have swirled for some time, and recently the
rumour mill has mostly got it right. The basic architecture is pretty
much a known quantity at this point, and it's a basic architecture
that shares a lot of common ground with the one powering the chip
we've just eulogised. Why mess too much with what's worked so well,
surely? "Correctamundo", says the Fonz, and the Fonz is always right.
It's all about the detail now, so we'll try and reveal as much as
possible to see where the deviance can be found. We'll delve into the
architecture first, before taking a look at the first two products it
powers, looking back to previous NVIDIA D3D10 hardware as necessary to
paint the picture.
NVIDIA GT200 Overview
The following diagram represents a high-level look at how GT200 is
architected and what some of the functional units are capable of. It's
a similar chip to G80, of that there's no doubt, but the silicon
surgery undertaken by NVIDIA's architects to create it means we have
quite a different beast when you take a look under the surface.
http://www.beyond3d.com/images/reviews/gt200-arch/GT200-full-1.2-26-05-08.png
If it's not clear from the above diagram, like G80, GT200 is a fully-
unified, heavily-threaded, self load-balancing (full time, agnostic of
API) shading architecture. It has decoupled and threaded data
processing, allowing the hardware to fully realise the goal of hiding
sampler latency by scheduling sampler threads independently of, and
asynchronously with, shading threads.
The design goals of the chip appear to be the improvement of D3D10
performance in general, especially at the Geometry Shader stage, with
the end result presumably as close to doubling the performance of a
similarly clocked G92 as possible. There's not 2x the raw performance
available everywhere on the chip of course, but the increase in
certain computation resources should see it achieve something like
that in practice, depending on what's being rendered or computed.
Let's look closer at the chip architecture, then. The analysis was
written with our original look at G80 in mind. The architecture we
discussed there is the basis for what we'll talk about today, so have
a good read of that to refresh your memory, and/or ask in the forums
if anything doesn't make sense. The original piece is a little
outdated in places, as we've discovered more about the chip as time
goes by over the last year and a half, so just ask about or let us
know about something that doesn't quite fit.
GT200: The Shading Core
http://www.beyond3d.com/images/reviews/gt200-arch/shader-core.png
GT200 demonstrates subtle yet distinct architectural differences when
compared to G80, the chip that pioneered the basic traits of this
generation of GPUs from Kirk and Co. As we've alluded to, G80 led a
family of chips that have underpinned the company's dominance over AMD
in the graphics space since its launch, so it's no surprise to see
NVIDIA stick to the same themes of execution, use of on-chip memories,
and approach to acceleration of graphics and non-graphics computation.
At its core, GT200 is a MIMD array of SIMD processors, partitioned
into what we call clusters, with each cluster a 3-way collection of
shader processors which we call an SM. Each SM, or streaming
multiprocessor, comprises 8 scalar ALUs, with each capable of FP32 and
32-bit integer computation (the only exception being multiplication,
which is INT24 and therefore still takes 4 cycles for INT32), a single
64-bit ALU for brand new FP64 support, and a discrete pool of shared
memory 16KiB in size.
The FP64 ALU is notable not just in its inclusion, NVIDIA supporting
64-bit computation for the first time in one of its graphics
processors, but in its ability. It's capable of a double precision MAD
(or MUL or ADD) per clock, supports 32-bit integer computation, and
somewhat surprisingly, signalling of a denorm at full speed with no
cycle penalty, something you won't see in any other DP processor
readily available (such as any x86 or Cell). The ALU uses the MAD to
accelerate software support for specials and divides, where possible.
Those ALUs are paired with another per-SM block of computation units,
just like G80, which provide scalar interpolation of attributes for
shading and a single FP-only MUL ALU. That lets each SM potentially
dual-issue 8 MAD+MUL instruction pairs per clock for general shading,
with the MUL also assisting in attribute setup when required.
However, as you'll see, that dual-issue performance depends heavily on
input operand bandwidth.
Each warp of threads still runs for four clocks per SM, with up to
1024 threads managed per SM by the scheduler (which has knock-on
effects for the programmer when thinking about thread blocks per
cluster). The hardware still scales back threads in flight if there's
register pressure of course, but that's going to happen less now the
RF has doubled in size per SM (and it might happen more gracefully now
to boot).
So, along with that pool of shared memory is connection to a per-SM
register file comprising 16384 32-bit registers, double that available
for each SM in G80. Each SP in each SM runs the same instruction per
clock as the others, but each SM in a cluster can run its own
instruction. Therefore in any given cycle, SMs in a cluster are
potentially executing a different instruction in a shader program in
SIMD fashion. That goes for the FP64 ALU per SM too, which could
execute at the same time as the FP32 units, but it shares datapaths to
the RF, shared memory pools, and scheduling hardware with them so the
two can't go full-on at the same time (presumably it takes the place
of the MUL/SFU, but perhaps it's more flexible than that). Either way,
it's not currently exposed outside of CUDA or used to boost FP32
performance.
That covers basic execution across a cluster using its own memory
pools. Across the shader core, each SM in each cluster is able to run
a different instruction for a shader program, giving each SM its own
program counter, scheduling resources, and discrete register file
block. A processing thread started on one cluster can never execute on
any other, although another thread can take its place every cycle. The
SM schedulers implement execution scoreboarding and are fed from the
global scheduler and per thread-type setup engines, one for VS, one
for GS and one for PS threads.
NVIDIA GT200 GPU and Architecture Analysis
Published on 16th Jun 2008, written by Rys for Consumer Graphics -
Last updated: 15th Jun 2008
Introduction
Sorry G80, your time is up.
There's no arguing that NVIDIA's flagship D3D10 GPU has held a reign
over 3D graphics that never truly saw it usurped, even by G92 and a
dubiously named GeForce 9-series range. The high-end launch product
based on G80, GeForce 8800 GTX, is still within spitting distance of
anything that's come out since in terms of raw single-chip
performance. It flaunts its 8 clusters, 384-bit memory bus and 24 ROPs
in the face of G92, meaning that products like 9800 GTX have never
really felt like true upgrades to owners of G80-based products.
That I type this text on my own PC powered by a GeForce 8800 GTX, one
that I bought -- which is largely unheard of in the world of tech
journalism; as a herd, we never usually buy PC components -- with my
own hard-earned, and on launch day no less, speaks wonders for the
chip's longevity. I'll miss you old girl, your 20 month spell at the
top of the pile is now honestly up. So what chip the usurper, and how
far has it moved the game on?
Rumours about GT200 have swirled for some time, and recently the
rumour mill has mostly got it right. The basic architecture is pretty
much a known quantity at this point, and it's a basic architecture
that shares a lot of common ground with the one powering the chip
we've just eulogised. Why mess too much with what's worked so well,
surely? "Correctamundo", says the Fonz, and the Fonz is always right.
It's all about the detail now, so we'll try and reveal as much as
possible to see where the deviance can be found. We'll delve into the
architecture first, before taking a look at the first two products it
powers, looking back to previous NVIDIA D3D10 hardware as necessary to
paint the picture.
NVIDIA GT200 Overview
The following diagram represents a high-level look at how GT200 is
architected and what some of the functional units are capable of. It's
a similar chip to G80, of that there's no doubt, but the silicon
surgery undertaken by NVIDIA's architects to create it means we have
quite a different beast when you take a look under the surface.
http://www.beyond3d.com/images/reviews/gt200-arch/GT200-full-1.2-26-05-08.png
If it's not clear from the above diagram, like G80, GT200 is a fully-
unified, heavily-threaded, self load-balancing (full time, agnostic of
API) shading architecture. It has decoupled and threaded data
processing, allowing the hardware to fully realise the goal of hiding
sampler latency by scheduling sampler threads independently of, and
asynchronously with, shading threads.
The design goals of the chip appear to be the improvement of D3D10
performance in general, especially at the Geometry Shader stage, with
the end result presumably as close to doubling the performance of a
similarly clocked G92 as possible. There's not 2x the raw performance
available everywhere on the chip of course, but the increase in
certain computation resources should see it achieve something like
that in practice, depending on what's being rendered or computed.
Let's look closer at the chip architecture, then. The analysis was
written with our original look at G80 in mind. The architecture we
discussed there is the basis for what we'll talk about today, so have
a good read of that to refresh your memory, and/or ask in the forums
if anything doesn't make sense. The original piece is a little
outdated in places, as we've discovered more about the chip as time
goes by over the last year and a half, so just ask about or let us
know about something that doesn't quite fit.
GT200: The Shading Core
http://www.beyond3d.com/images/reviews/gt200-arch/shader-core.png
GT200 demonstrates subtle yet distinct architectural differences when
compared to G80, the chip that pioneered the basic traits of this
generation of GPUs from Kirk and Co. As we've alluded to, G80 led a
family of chips that have underpinned the company's dominance over AMD
in the graphics space since its launch, so it's no surprise to see
NVIDIA stick to the same themes of execution, use of on-chip memories,
and approach to acceleration of graphics and non-graphics computation.
At its core, GT200 is a MIMD array of SIMD processors, partitioned
into what we call clusters, with each cluster a 3-way collection of
shader processors which we call an SM. Each SM, or streaming
multiprocessor, comprises 8 scalar ALUs, with each capable of FP32 and
32-bit integer computation (the only exception being multiplication,
which is INT24 and therefore still takes 4 cycles for INT32), a single
64-bit ALU for brand new FP64 support, and a discrete pool of shared
memory 16KiB in size.
The FP64 ALU is notable not just in its inclusion, NVIDIA supporting
64-bit computation for the first time in one of its graphics
processors, but in its ability. It's capable of a double precision MAD
(or MUL or ADD) per clock, supports 32-bit integer computation, and
somewhat surprisingly, signalling of a denorm at full speed with no
cycle penalty, something you won't see in any other DP processor
readily available (such as any x86 or Cell). The ALU uses the MAD to
accelerate software support for specials and divides, where possible.
Those ALUs are paired with another per-SM block of computation units,
just like G80, which provide scalar interpolation of attributes for
shading and a single FP-only MUL ALU. That lets each SM potentially
dual-issue 8 MAD+MUL instruction pairs per clock for general shading,
with the MUL also assisting in attribute setup when required.
However, as you'll see, that dual-issue performance depends heavily on
input operand bandwidth.
Each warp of threads still runs for four clocks per SM, with up to
1024 threads managed per SM by the scheduler (which has knock-on
effects for the programmer when thinking about thread blocks per
cluster). The hardware still scales back threads in flight if there's
register pressure of course, but that's going to happen less now the
RF has doubled in size per SM (and it might happen more gracefully now
to boot).
So, along with that pool of shared memory is connection to a per-SM
register file comprising 16384 32-bit registers, double that available
for each SM in G80. Each SP in each SM runs the same instruction per
clock as the others, but each SM in a cluster can run its own
instruction. Therefore in any given cycle, SMs in a cluster are
potentially executing a different instruction in a shader program in
SIMD fashion. That goes for the FP64 ALU per SM too, which could
execute at the same time as the FP32 units, but it shares datapaths to
the RF, shared memory pools, and scheduling hardware with them so the
two can't go full-on at the same time (presumably it takes the place
of the MUL/SFU, but perhaps it's more flexible than that). Either way,
it's not currently exposed outside of CUDA or used to boost FP32
performance.
That covers basic execution across a cluster using its own memory
pools. Across the shader core, each SM in each cluster is able to run
a different instruction for a shader program, giving each SM its own
program counter, scheduling resources, and discrete register file
block. A processing thread started on one cluster can never execute on
any other, although another thread can take its place every cycle. The
SM schedulers implement execution scoreboarding and are fed from the
global scheduler and per thread-type setup engines, one for VS, one
for GS and one for PS threads.