NVIDIA GT200 GPU and Architecture Analysis

NV55 · Jun 16, 2008

http://www.beyond3d.com/content/reviews/51

NVIDIA GT200 GPU and Architecture Analysis

Published on 16th Jun 2008, written by Rys for Consumer Graphics -
Last updated: 15th Jun 2008

Introduction

Sorry G80, your time is up.

There's no arguing that NVIDIA's flagship D3D10 GPU has held a reign
over 3D graphics that never truly saw it usurped, even by G92 and a
dubiously named GeForce 9-series range. The high-end launch product
based on G80, GeForce 8800 GTX, is still within spitting distance of
anything that's come out since in terms of raw single-chip
performance. It flaunts its 8 clusters, 384-bit memory bus and 24 ROPs
in the face of G92, meaning that products like 9800 GTX have never
really felt like true upgrades to owners of G80-based products.

That I type this text on my own PC powered by a GeForce 8800 GTX, one
that I bought -- which is largely unheard of in the world of tech
journalism; as a herd, we never usually buy PC components -- with my
own hard-earned, and on launch day no less, speaks wonders for the
chip's longevity. I'll miss you old girl, your 20 month spell at the
top of the pile is now honestly up. So what chip the usurper, and how
far has it moved the game on?

Rumours about GT200 have swirled for some time, and recently the
rumour mill has mostly got it right. The basic architecture is pretty
much a known quantity at this point, and it's a basic architecture
that shares a lot of common ground with the one powering the chip
we've just eulogised. Why mess too much with what's worked so well,
surely? "Correctamundo", says the Fonz, and the Fonz is always right.

It's all about the detail now, so we'll try and reveal as much as
possible to see where the deviance can be found. We'll delve into the
architecture first, before taking a look at the first two products it
powers, looking back to previous NVIDIA D3D10 hardware as necessary to
paint the picture.

NVIDIA GT200 Overview

The following diagram represents a high-level look at how GT200 is
architected and what some of the functional units are capable of. It's
a similar chip to G80, of that there's no doubt, but the silicon
surgery undertaken by NVIDIA's architects to create it means we have
quite a different beast when you take a look under the surface.

http://www.beyond3d.com/images/reviews/gt200-arch/GT200-full-1.2-26-05-08.png

If it's not clear from the above diagram, like G80, GT200 is a fully-
unified, heavily-threaded, self load-balancing (full time, agnostic of
API) shading architecture. It has decoupled and threaded data
processing, allowing the hardware to fully realise the goal of hiding
sampler latency by scheduling sampler threads independently of, and
asynchronously with, shading threads.

The design goals of the chip appear to be the improvement of D3D10
performance in general, especially at the Geometry Shader stage, with
the end result presumably as close to doubling the performance of a
similarly clocked G92 as possible. There's not 2x the raw performance
available everywhere on the chip of course, but the increase in
certain computation resources should see it achieve something like
that in practice, depending on what's being rendered or computed.

Let's look closer at the chip architecture, then. The analysis was
written with our original look at G80 in mind. The architecture we
discussed there is the basis for what we'll talk about today, so have
a good read of that to refresh your memory, and/or ask in the forums
if anything doesn't make sense. The original piece is a little
outdated in places, as we've discovered more about the chip as time
goes by over the last year and a half, so just ask about or let us
know about something that doesn't quite fit.

GT200: The Shading Core

http://www.beyond3d.com/images/reviews/gt200-arch/shader-core.png

GT200 demonstrates subtle yet distinct architectural differences when
compared to G80, the chip that pioneered the basic traits of this
generation of GPUs from Kirk and Co. As we've alluded to, G80 led a
family of chips that have underpinned the company's dominance over AMD
in the graphics space since its launch, so it's no surprise to see
NVIDIA stick to the same themes of execution, use of on-chip memories,
and approach to acceleration of graphics and non-graphics computation.

At its core, GT200 is a MIMD array of SIMD processors, partitioned
into what we call clusters, with each cluster a 3-way collection of
shader processors which we call an SM. Each SM, or streaming
multiprocessor, comprises 8 scalar ALUs, with each capable of FP32 and
32-bit integer computation (the only exception being multiplication,
which is INT24 and therefore still takes 4 cycles for INT32), a single
64-bit ALU for brand new FP64 support, and a discrete pool of shared
memory 16KiB in size.

The FP64 ALU is notable not just in its inclusion, NVIDIA supporting
64-bit computation for the first time in one of its graphics
processors, but in its ability. It's capable of a double precision MAD
(or MUL or ADD) per clock, supports 32-bit integer computation, and
somewhat surprisingly, signalling of a denorm at full speed with no
cycle penalty, something you won't see in any other DP processor
readily available (such as any x86 or Cell). The ALU uses the MAD to
accelerate software support for specials and divides, where possible.

Those ALUs are paired with another per-SM block of computation units,
just like G80, which provide scalar interpolation of attributes for
shading and a single FP-only MUL ALU. That lets each SM potentially
dual-issue 8 MAD+MUL instruction pairs per clock for general shading,
with the MUL also assisting in attribute setup when required.
However, as you'll see, that dual-issue performance depends heavily on
input operand bandwidth.

Each warp of threads still runs for four clocks per SM, with up to
1024 threads managed per SM by the scheduler (which has knock-on
effects for the programmer when thinking about thread blocks per
cluster). The hardware still scales back threads in flight if there's
register pressure of course, but that's going to happen less now the
RF has doubled in size per SM (and it might happen more gracefully now
to boot).

So, along with that pool of shared memory is connection to a per-SM
register file comprising 16384 32-bit registers, double that available
for each SM in G80. Each SP in each SM runs the same instruction per
clock as the others, but each SM in a cluster can run its own
instruction. Therefore in any given cycle, SMs in a cluster are
potentially executing a different instruction in a shader program in
SIMD fashion. That goes for the FP64 ALU per SM too, which could
execute at the same time as the FP32 units, but it shares datapaths to
the RF, shared memory pools, and scheduling hardware with them so the
two can't go full-on at the same time (presumably it takes the place
of the MUL/SFU, but perhaps it's more flexible than that). Either way,
it's not currently exposed outside of CUDA or used to boost FP32
performance.

That covers basic execution across a cluster using its own memory
pools. Across the shader core, each SM in each cluster is able to run
a different instruction for a shader program, giving each SM its own
program counter, scheduling resources, and discrete register file
block. A processing thread started on one cluster can never execute on
any other, although another thread can take its place every cycle. The
SM schedulers implement execution scoreboarding and are fed from the
global scheduler and per thread-type setup engines, one for VS, one
for GS and one for PS threads.

NV55 · Jun 16, 2008

GT200: Sampling and the ROP

http://www.beyond3d.com/images/reviews/gt200-arch/tpc.png

For data fetch and filtering, each cluster is connected to its own
discrete sampler unit (with cluster + samplers called the texture
processing cluster or TPC by NVIDIA), with each one able to calculate
8 sample addresses and bilinearly filter 8 samples per clock. That's
unchanged compared to G92, but it's worth pointing out that prior
hardware could never reach the bilinear peak outside of (strangely
enough) scalar FP32 textures. It's now obtainable (or at least much
closer) thanks to, according to NVIDIA, tweaks to the thread scheduler
and sampler I/O. We still heavily suspect though that one of the key
reasons is additional shared INT16 hardware for what we imagine
actually is a shared addressing/filtering unit. Either way, each
sampler has a dedicated L1 cache which is likely 16KiB and all sampler
units share a global L2 cache that we believe is double the size of
that in G80 at 256KiB. The sampler hardware runs at the chip base
clock, whereas the shading units run at the chip hot clock, which is
most easily thought of as being 2x the scheduler clock. Along with the
memory clock, those mentioned clocks comprise the main domains in
GT200, just like they did in G80.

The hardware is advertised as supporting D3D10.0, since its
architecture is marginally incapable of supporting 10.1, by virtue of
the ROP hardware. D3D10 compliance means the ability in hardware for
recycling data from GS stage of the computation model back through the
chip for another pass. The output buffer for that is six times larger
in GT200 than in G80, although NVIDIA don't disclose the exact size.
Given that the GS stage is capable of data amplification (and de-
amplification of course), the increased buffer size represents a
significant change in what the architecture is capable of in a
performance sense, if not a theoretical sense. The same per-thread
output limits are present, but now more GS threads can now be run at
the same time.

That covers the changes to on-chip memories that each cluster has
access to. Quickly returning to the front of the chip, It appears that
the hardware can still only setup a single triangle per clock, and the
rasteriser is largely unchanged. Remember that in G80, the rasteriser
worked on 32 pixel blocks, correlating to the pixel batch size. GT200
continues to work on the same size pixel blocks as it sends the screen
down through the clusters as screen tiles for shading.

http://www.beyond3d.com/images/reviews/g80-arch/g80-quad-rop.png

At the back of the chip, after computation via each TPC, the same
basic ROP architecture as G80 is present. With the external memory bus
512 bits wide this time and each 64-bit memory channel serving a ROP
partition, that means 8 ROP partitions, each partition housing a
quartet of ROP units. 32 in total then. Each ROP is now capable of a
full-speed INT8 or FP16 channel blend per cycle, whereas G80 needed
two cycles to complete the same operations. This guarantees that
blending isn't ROP limited, which could already be the case on G80 and
would have become even more of a problem with a higher memory/core
clock ratio. It might also initially seem odd that FP16 is also
supported at full-speed despite being certainly bandwidth limited, but
remember that full-speed FP16 also means that 32-bit floating point
pixels made up of three FP10 channels for colour and 2 bits for alpha
also go faster for free and that's not easy to do otherwise.

The ROP partitions talk to GDDR3 memory only in GT200. We mention that
in passing since it affects how the architecture works due to burst
length, where you need to be sure to match what the DRAM wants every
time you feed it or ask for data in any given clock cycle, especially
when sampling. GDDR4 support seems non-existant, and we're certain
there's no GDDR5 support in the physical interface (PHY) either. The
number of ROP partitions means that with suitably fast memory, GT200
easily joins that exclusive club of microprocessors with more than
100GB/sec to their external DRAM devices. No other class of processor
in consumer computing enjoys that at the time of writing.
The ROP also improves on peak compression performance compared to both
G80 and G92, allowing it to do more with the available memory
bandwidth, not that 512-bit and fast graphics DRAMs mean there's a
lack of the stuff available to GT200-based SKUs, more on which later.

That's largely it in terms of the chip's new or changed architectural
traits in a basic sense. The questions posed now mostly become ones of
scheduling changes, and how memory access differs when compared to
prior implementations of the same basic architecture in the G8x and
G9x family of GPUs.

GT200: General Architecture Notes

We mentioned that the big questions posed now mostly become ones of
scheduling changes, and how memory access differs when compared to
prior implementations of the same basic architecture in the G8x and
G9x family of GPUs. Where it concerns the former question, it becomes
prudent to wonder whether the 'missing' MUL is finally available for
general shading (along with the revelation about its inclusion in G8x
and G9x, which we might one day share).

We've been able to verify freer issue of the instruction in general
shading, but not near the theoretical peak when the chip is executing
graphics codes. NVIDIA mention improvement to register allocation and
scheduling as the reason behind the freer execution of the MUL, and we
believe them. However it looks likely that it's only able to retire a
result every second clock because of operand fetch in graphics mode,
effectively halving its throughput. In CUDA mode, operand fetch seems
more flexible, with thoughput nearer peak, although we've not spent
enough time with the hardware yet to really be perfectly sure.
Regardless, at this point it seems impossible to extract the peak
figure of 933Gflops FP32 with our in-house graphics codes. How much
this matters depends on whether you can use the MUL implicitly through
attribute interpolation the rest of the time, which we aren't sure
about just yet either.

After that it's probably best to worry about GS performance in D3D10
graphical applications, which we'll do when it comes time to benchmark
the hardware. The new output buffer size increase is one of the bigger
architectural differences, maybe even more so than the addition of the
extra SM per cluster. Adoption of the GS stage in the D3D10 pipe has
undoubtedly been held back a little by the typical NVIDIA tactic of
building just enough in silicon to make a feature work, but building
too little to make it immediately useful.

The increase in register file, a doubling over the number of per-SM
registers available to G8x and G9x chips, means that there's less
pressure for the chip to decrease the number of possible in-flight
threads, letting latency hiding from the sampler hardware (it's the
same 200+ cycles latency to DRAM as with G80 from the core clock's
point of view) become more effective than it ever has done in the past
with this architecture. Performance becomes freer and easier in other
words, the schedulers more able to keep the cluster busy under heavy
shading loads. Developers now need to worry less about their
utilisation of the chip, not that we guess many really were with G80
and G92. The other G8x and G9x parts have different performance traits
for a developer to consider there, given how NVIDIA (annoyingly in the
low-end from a developer perspective) scaled them down from the
grandfather parts.

That per-SM shared memory didn't increase is interesting too. The way
the CUDA programming model works means that a static shared memory
size across generations is attractive for the application developer.
He or she doesn't have to tweak their codes too much to make the best
use of GT200, given that shared memory size didn't change. However
given that CUDA codes will have to be rewritten for GT200 anyway if
the application developer wants to make serious use of FP64
support.... ah, but that's comparatively slow in GT200, and heck,
16KiB for every SM is a fair aggregate chunk of SRAM when multiplied
out across the whole chip. 1.4B transistors sounds like room to
breathe, but we doubt NVIDIA see it as an excuse to be so blasé about
on-chip SRAM pools, even if they are inherently redundant parts of the
chip which will help yields of the beast.

Minor additional notes about the processing architecture include
improvements to how the input assembler can communicate with the DRAM
devices through the memory crossbar, allowing more efficient indexing
into memory contents when fetching primitive data, and a larger post-
transform cache to help feed the rasteriser a bit better. Primitive
setup rate is unchanged, which is a little disappointing given how
much you can be limited there during certain common and intensive
graphics operations. Assuming there's no catch, this is likely one of
the big reasons why performance improvements over G80 are more
impressive at ultra-high-end resolutions (along with the improved
bilinear filtering and ALU performance which also become more
important there).

GT200: Thoughts on positioning and the NVIO Display Pipe

It's easy enough to be blasé as the writer talking about the
architecture. Here's hoping the differences present don't add up to
conclusions of “it's just a wider G80” in the technical press. It's a
bit more than that, when surfaces are scratched (and sampled and
filtered, since we're talking about graphics).

The raw numbers do tell a tale, though, and it's no small piece of
silicon even in 55nm form as a 'GT200b'. In fact, it's easily the
biggest single piece of silicon ever sold to the general PC-buying
populace, and we're confident it'll hold that crown until well into
2009. When writing about GT200 I've found my mind wandering to that
horribly cheesy analogy that everyone loves to read about from the
linguistically-challenged technical writer. What do I compare it to
that everyone will recognise, that does it justice? I can't help but
imagine the Cloverfield monster wearing a dainty pair of pink
ballerina shoes, as it destroys everything in the run to the end game.
Elegant brawn, or something like that. You know what I mean. That also
means I get to wonder out loud and ask if ATI are ready to execute the
Hammer Down protocol.

It'll need to if it wants to conquer a product stack that'll see
NVIDIA make use of cheap G92 and G92b (55nm) based products underneath
the GT200-based models it's introducing today. That leads us on nicely
to talking about how NVIDIA can scale GT200 in order to have it drive
multiple products scaled not just in clock, but in enabled unit count.

GT200 is able to be scaled in terms of active cluster count and the
number of active ROP partitions, at a basic level. At a more advanced
level, the FP64 ALU is freely removed, and we fully assume that to be
the case for lower-end derivatives. For this chip though, it follows
the same redundancy and product scaling model that we famously saw
with G80 and then G92. So initially, we'll see a product based on the
full configuration of 10 clusters and 8 ROP partitions, with the full
512-bit external memory bus that brings. Along with that there'll be
an 8 cluster model with 448-bit memory interface (so a single ROP
partition disabled there). Nothing exciting then, and what one would
reasonably expect given the past history of chips with the same basic
architecture.
Display Pipe

We've tacked it on to the back end of the architecture discussion, but
it's worth mentioning because of how it's manfiest in hardware. So as
far as the display pipe goes, you've got the same 10bpc support as
G80, and it's via NVIO (a new revision) again this time. The video
engine is almost a direct cut and paste from G84, G92 et al, so we get
to call it VP2 and grumble under our breath about the overall state of
PC HD video in the wake of HD DVD losing out to BluRay. It's based on
Tensilica IP (just like AMD's UVD), NVIDIA using the company's area-
efficient DSP cores to create the guts of the video decode hardware,
with the shader core used to improve video quality rather than assist
in the decode process. The chip supports a full range of analogue and
digital display outputs, including HDMI with HDCP protection, as you'd
expect from a graphics product in the middle of 2008.

To portend to DisplayPort port support.... it's possible, but that's
up to the board vendor and whether they want to use an external
transmitter. Portunately they can.

NV55 · Jun 16, 2008

Physicals and GeForce GTX 200 Products

Physically

So what about GT200 physically? We need to talk about that before we
discuss the products it's going to enable initially, because those
physical properties have the most direct effect on board-level metrics
like size, power, heat and noise.

NVIDIA won't say exactly how big, but we're pretty confident based on
a variety of data that the chip is just about 600mm2 in size, at
roughly 24.5mm x 24.5mm. It's built by TSMC on their 65nm major
process node, and it's the biggest mass-market chip they've ever
produced. It's 1.4 billion transistors heavy, and comes clocked in the
same rough ranges that G80 was. We'll note the exact clocks for the
two launch SKUs shortly. The wafer shot on page one hints at ~90-95
candidate dice per 300mm start. That's not a lot. We'd call it the Kim
Kardashian's ass of the graphics world, but that wouldn't be
appropriate for such a sober laugh-free publication as ours.

The package for the chip has 2485 pins on the underside as you can see
(compared to 2140 on the already massive R600), connecting the
processor to the NVIO chip which handles all I/O signalling, power,
and the connected DRAMs. A significant portion of the pins are for the
power plane, more than the chip needs for I/O and memory connectivity
combined.

In terms of clocking, the chip sports more aggressive levels of clock
gating, helping GT200 achieve considerably lower idle power draw, at
about 25W, than a high-end G92-based product like GeForce 9800 GTX
(about 3x higher or so). Interestingly, this clock gating helps keep
average power consumption a bit lower than you'd expect given the TDP.
The chip also has a dedicated power mode for video playback, turning
off a significant percentage of the chip to achieve low power figures
when pretty much only the VP2 silicon and NVIO are working.

The chip supports HybridPower, which lets an IGP on the mainboard be
responsible for display output when the chip is mostly idle, including
when using it to display HD video content via VP2. The potential power
consumption savings and noise benefits are quite large, especially in
SLI, should your mainboard support a compatible IGP from NVIDIA.

GeForce GTX 200 Series

NVIDIA are redefining the product name and model numbering scheme for
GeForce, starting with GT200-based products. The first two SKUs are
called GTX 260 and GTX 280, the prefix defining the rough segment of
performance and the number the relative positioning in that segment.
We wonder if the first digit in the number will denote the D3D class,
with 2 correlating to D3D10. GTX 3xx for D3D11 or D3D10.1? We doubt
even NVIDIA Marketing knows at this point.

GeForce GTX 280

http://www.beyond3d.com/images/reviews/gt200-arch/gtx280-big.jpg

The GTX 280 is the current GT200-based flagship. It uses GT200 with
all units enabled, giving it 240 FP32 SPs, 30 FP64 SPs, 512-bit
external memory bus and 1024MiB of connected DRAMs. The chip is
clocked at 602MHz for the base clock, 1296MHz for the hot clock, and
thus 648MHz for the global scheduler. DRAM clock is 1107MHz, or
2214MHz effective.

That gives rise to headline figures of 933Gflops peak FP32
programmable shading rate (1296 x 240 x 3), 77.7Gflops DP (1296 x 30 x
2), 141.7GB/sec of memory bandwidth, 48.16Gsamples/sec bilinear and
nearly 20Gpixel/sec colour fill out of the ROPs. The board power at
peak is 236W, it requires two power connections, one of them the newer
8-pin standard, with the other the established 6-pin. There's no way
to run the board with only 6-pin power, sadly, because of the extra
demands on power draw in graphics mode.

The cooler is dual-slot, exhausting air out of the case via the
backplane, using a cooler similar to that seen recently with GeForce
9800 GTX, and earlier with GeForce 8800 GTS 512. The backplane sport
two dual-link DVI ports, with HDMI possible via active dongle
connected to either port. HDTV output via component is supported two,
via the smaller 7-pin analogue connector. Al GeForce GTX 280s will
support HDCP on both outputs, with dual-link encryption.

The board also supports two SLI connectors, for 3-way SLI ability.

GeForce GTX 260

http://www.beyond3d.com/images/reviews/gt200-arch/gtx260-big.jpg

The GTX 260 shares almost identical physical properties, including a
very similar PCB and component layout, identical cooler and display
output options, and 3-way SLI support. With clocks of 576MHz base,
1242MHz hot, 621MHz global scheduler and 999MHz (1998MHz effective)
memory, along with the disabling of two clusters and a ROP channel,
the power requirements are lessened. The disabled ROP channel means
896MiB of connected GDDR3.

Peak board power is 182 watts, and power supply is just 2 x 6-pin this
time. The clocks give rise to headline numbers of 715Gflops FP32 via
192 SPs, 59.6Gflops DP via 24 FP64 SPs, 111.9GB/sec memory bandwidth
via a 448-bit memory interface, 36.9Gsamples/sec bilinear, and
16.1Gpixels/sec colour fillrate out of the ROPs.

Both boards are PCI Express 2.0 native as mentioned, and both sport
the VP2 silicon as you know. Our 3D Tables will let you compare to
older products to see where things have improved, if you didn't pick
it up from the text.

Architecture Summary

Because GT200 doesn't implement a brand new architecture or change
possible image quality compared to G80 or G92, we've been able to skip
discussion of large parts of the chip simply because they're
unchanged. There's nothing new to talk about in terms of maximum per-
pixel IQ, because the crucial components of the chip that make that
all happen have no improvements or changes to speak of. It's purely a
question of performance and how that's derived.

http://www.beyond3d.com/images/reviews/gt200-arch/gt200die-big.jpg

If you've got your wonky monocle on (every graphics enthusiast has
one, so they can squint and replicate Quincunx in real-time, with REAL
pixels), it's possible to look at GT200 and see 1.4B transistors and
wonder why 2x G92 across the board wasn't because, after all, it's
nearly double the transistor count. The reality is that transistors
have been spent elsewhere in the chip however, for CUDA among other
things. Furthermore, and perhaps more importantly, some potential
minor bottlenecks such as triangle setup remain seemingly unchanged
while clocks also went down.

The stark reality is that GT200 has more of an eye on CUDA and non-
graphics compute than any other NVIDIA processor before, and it speaks
volumes, especially as they continue to ramp up the CUDA message and
deploy Tesla into outposts of industry the company would previously
have no business visiting. Oil and gas, computational finance, medical
imaging, seismic exploration, bioinformatics and a whole host of other
little niches are starting to open up, and what's primarily a
processor designed for drawing things on your screen is now tasked
with doing a lot more these days. The base computational foundation
laid down by G80 now has DP and full-speed denormal support, which is
no small matter as a new industry grows up. We'll cover that
separately, since we took in a recent Editor's Day in Satan Clara
related to just that.

We've not been able to spend much time with hardware to date, but
we've been able to throw some graphics and CUDA programs at a chip or
two (and a special thank you to the guy who helped me run some simple
D3D9 on a GTX 280 last night, très bien monsieur!). Performance is
high when testing theoretical rates in the shader core, and although
we can't see NVIDIA's claimed 93-94% efficiency when scheduling the
SFU, ~1.5 MULs/clock/SP is easy enough with graphics code, and we see
a fair bit higher under CUDA. That contrasts to the same shaders on
G80, where the MUL is resolutely missing in graphics mode, regardless
of what you dual-issue with it.

We can almost hit the bilinear peak when texturing, which proves
general performance claims there, and if I could be bothered to reboot
to Windows and fire up Photoshop, I could make a version of the old
70Gpixel/sec Z-only image, only this time it'd be around the 1/10th of
a terazixel mark. That's right, I said terazixel. While we haven't
measured blend rate, it's next on our list to do properly, but we're
confident the theoretical figures will be born out during testing.

With 3D games, and it's prudent to talk about the split execution
personalities of the chip in terms of a graphics mode and the compute
mode, because they affect performance and power consumption, depending
on the game and resolution tested (of course, caveats ahoy here),
30-100% or so more performance is visible with GeForce GTX 280,
compared to a single GeForce 9800 GTX. Quite a swing, and the upper
end of that range is seemingly consumed by D3D9 app performance.
Compared to the multi-GPU product du jour right now, GeForce 9800 GX2,
it does less well and is often beat at high resolution depending on
memory usage. Yes, we're waving our hands around a bit here without
graphs to back it up, so we forward you to the fine fellows at The
Tech Report and Hardware.fr for more data.

As the single chip paragon of graphics virtue and greater internets
justice, GT200 has no peers, and GeForce GTX 280 looks like a fine
caddy for this latest of our silicon overlords. We question heavily
whether the asking price is worth it, with GTX 260 looking much better
value there if you must buy GT200 right now, but we daren't say much
more until we've had proper hands-on testing with games and some more
CUDA codes. There's something about having 512-bit and 1GiB on one
chip though.

We mentioned at the top of the article that G80 has finally been truly
usurped in our eyes for 3D. At a simple level, 1.4B transistors
holding huge peak bilinear texturing rates, 256 Z-only writes per
clock and around 780 available Gflops in graphics mode will tend to do
that. More on GT200 and its competition over the coming next couple of
weeks, and keep your eyes peeled on the forums for more data before
new articles show up.

B3D discussion thread: http://forum.beyond3d.com/showthread.php?t=48563

Cool · Jun 19, 2008

Yup and they can keep it. Not impressed at all.

Memnoch · Jun 19, 2008

Yup and they can keep it. Not impressed at all.

I won't be dumping my GX2 anyway. :-)

NVIDIA GT200 GPU and Architecture Analysis

NV55

NV55

NV55

Cool

Memnoch