NV40 getting over 7000M pixels/sec (over 7G pixels!)

  • Thread starter Thread starter NV55
  • Start date Start date
N

NV55

http://www.beyond3d.com/forum/viewtopic.php?t=10946


quote:

==============================================================================
Well it's time, some of this info in not new and it was speculated
before, now is just confirmed.
The card that was running those games was an NV40 ( it was an A2
revision ) clocked at 475mhz Core and 1.2ghz memory ( gddr3 ) and
256mb. The card IT IS 16x1 as the fillrate shows it. With 3dmark2001
give results of 7010 Mpixels/s for single texturing and 7234 Mpixels/s
for multi texturing ( pretty impressive don't? ).The PS performance is
boosted a lot, at least 2.5x faster and even got to 5x faster.The AA
used in the pics is RGMSAA at 4x, the maximum AF ( on this drivers,
56.55 ) was 8x but I think that we will see 16x. In most of the games
the performance was 2.5x to 3x faster than my FX5950 using high
resolution and AA as well as AF and well the IQ...... what can I say?
is far far better. Now some things, they were a lot of problems,
mostly corrupted textures, but not in any of the tested games, I'm
100% sure that at launch the drivers will pack a lot more of features
just for the NV40. In other thing, I can't argue that indeed there is
some blurring in those SS, but first think:

1) if you don't zoom the picture it looks just damn good.
2) there are no final drivers, so no final AA or AF levels.
3) I'm not saying is better than ati or anything like that, I just
think is WAY better than any other Nvidia card.

and for a matter of speed, remember that A2 is no the final revision,
so the final "could" be a little faster. Well that's all... at least
all I can say now, there is something else but I risk too much and I
have a promise, maybe later in this month, but I have said everything
I know, hope you like it, and again excuse my bad english



Halo ( no AA, 8x AF ) 1600x1200 51.1fps
Far Cry ( no AA, no AF ) 1600x1200 53.4fps
UT2004 ( 4xAA, 4xAF ) 1600x1200 71.9fps - 1280x960 82.9fps ( both
botmach )
Star Wars Kotor ( 4xAA, 8xAF ) 1600x1200 49.3fps
===============================================================================


I think this would confirm that NV40 does indeed have 16 pixel
pipelines.
 
And not with AGP..?

Implying... ?

AGP is sure not the bottleneck regardless of what the PCI-Express
gang would have you believe... CPU is the limp rag here. And it
will get even limper when it has to run advanced AI algorithms
for next-gen games instead of scripts.

John Lewis
 
And how fast of a CPU would you need to properly feed the NV40, as well
as ATI's R420 and R423 parts?

Depends on the application that is feeding it. In optimal case the CPU don't
have to be even near cutting-edge. The proper way to feed GPU is to do
animation with the GPU, skinning with the GPU, have the data reside in the
GPU local memory and just call ->DrawIndexedPrimitive() (D3D) or
glDrawElements() or execute display list or whatever is fastest renderpath
(VBO and DLIST are quite fast most of the time on PEECEE) on particular card
(GL).

CPU is relevant mostly with D3D DIP overhead, for some reason DIP's eat cpu
cycles very hungrily.. so if you want your application to be fast, you
organize the data before rendering so that you can minimize the number of
DIP calls. CPU speed enters the equation only when something is done WRONG..
 
AGP is sure not the bottleneck regardless of what the PCI-Express
gang would have you believe... CPU is the limp rag here. And it
will get even limper when it has to run advanced AI algorithms
for next-gen games instead of scripts.

Scripts...? How GPU runs scripts? Tell me that. It can execute fragment- and
vertex programs, feedback is high-overhead because reading from render
target is the most feasible way and that's slow as f**k, not the least
because the rendering must complete before you can query results from the
textures.

This is just not done by "realtime" applications which want > 60 fps. The
textures can be used repeatedly within same frame, though, but that is done
inside the GPU not feeding back to the CPU address space.

Animation is done to a degree with GPU vertex programs, using lerp and
updating the keyframe data to vertexbuffer periodically, since it overlaps
for different "objects" the updating is distributed evenly over the frames.
That sort of stuff is NOT 'scripting', but please eloborate what you mean by
'scripting' in context of GPU's, all ears.
 
http://www.beyond3d.com/forum/viewtopic.php?t=10946


quote:

==============================================================================
Well it's time, some of this info in not new and it was speculated
before, now is just confirmed.
The card that was running those games was an NV40 ( it was an A2
revision ) clocked at 475mhz Core and 1.2ghz memory ( gddr3 ) and
256mb. The card IT IS 16x1 as the fillrate shows it. With 3dmark2001
give results of 7010 Mpixels/s for single texturing and 7234 Mpixels/s
for multi texturing ( pretty impressive don't? ).The PS performance is
boosted a lot, at least 2.5x faster and even got to 5x faster.The AA
used in the pics is RGMSAA at 4x, the maximum AF ( on this drivers,
56.55 ) was 8x but I think that we will see 16x. In most of the games
the performance was 2.5x to 3x faster than my FX5950 using high
resolution and AA as well as AF and well the IQ...... what can I say?
is far far better. Now some things, they were a lot of problems,
mostly corrupted textures, but not in any of the tested games, I'm
100% sure that at launch the drivers will pack a lot more of features
just for the NV40. In other thing, I can't argue that indeed there is
some blurring in those SS, but first think:

1) if you don't zoom the picture it looks just damn good.
2) there are no final drivers, so no final AA or AF levels.
3) I'm not saying is better than ati or anything like that, I just
think is WAY better than any other Nvidia card.

and for a matter of speed, remember that A2 is no the final revision,
so the final "could" be a little faster. Well that's all... at least
all I can say now, there is something else but I risk too much and I
have a promise, maybe later in this month, but I have said everything
I know, hope you like it, and again excuse my bad english



Halo ( no AA, 8x AF ) 1600x1200 51.1fps
Far Cry ( no AA, no AF ) 1600x1200 53.4fps
UT2004 ( 4xAA, 4xAF ) 1600x1200 71.9fps - 1280x960 82.9fps ( both
botmach )
Star Wars Kotor ( 4xAA, 8xAF ) 1600x1200 49.3fps
===============================================================================


I think this would confirm that NV40 does indeed have 16 pixel
pipelines.

Looks interresting... lets still wait what ATI has to offer! Ya never
know what the Ati Engineers have cooked up!

But I am happy nvidia now uses an about the same anti aliasing method
as ati did.
 
"joe smith"
<john.smith@iiuaudhahsyasdy232462643264276asdhfvhdsafhasdgdsagyufasgyufdashu
fdashuyfhuysafhuysafhuydh27324242742647623762667bhfbdsahbvfahds.net> wrote
in message news:[email protected]...
Scripts...? How GPU runs scripts? Tell me that. It can execute fragment- and
vertex programs, feedback is high-overhead because reading from render
target is the most feasible way and that's slow as f**k, not the least
because the rendering must complete before you can query results from the
textures.

This is just not done by "realtime" applications which want > 60 fps. The
textures can be used repeatedly within same frame, though, but that is done
inside the GPU not feeding back to the CPU address space.

Animation is done to a degree with GPU vertex programs, using lerp and
updating the keyframe data to vertexbuffer periodically, since it overlaps
for different "objects" the updating is distributed evenly over the frames.
That sort of stuff is NOT 'scripting', but please eloborate what you mean by
'scripting' in context of GPU's, all ears.

I believe John was simply implying that as AI in games becomes more
advanced, frame rate will suffer due to CPU diverting more time to AI. Try
running something CPU intensive like SETI or a video encoding program like
TMPGenc with priority set to high then start a game to get an idea.

I would imagine physic engines would have a similar effect too as they
become more advanced.
 
Scripts...? How GPU runs scripts?

You might just need a visit to your eye specialist...... Read the
above quote from my original post again, please !!!


John Lewis
 
Depends on the application that is feeding it. In optimal case the CPU don't
have to be even near cutting-edge. The proper way to feed GPU is to do
animation with the GPU, skinning with the GPU, have the data reside in the
GPU local memory and just call ->DrawIndexedPrimitive() (D3D) or
glDrawElements() or execute display list or whatever is fastest renderpath
(VBO and DLIST are quite fast most of the time on PEECEE) on particular card
(GL).

CPU is relevant mostly with D3D DIP overhead, for some reason DIP's eat cpu
cycles very hungrily.. so if you want your application to be fast, you
organize the data before rendering so that you can minimize the number of
DIP calls. CPU speed enters the equation only when something is done WRONG..

Now that you have read my original post correctly, let me elaborate
on the fact that the load on the CPU will play a significant role
in the potential throttling of graphics capability in future games.

If you have BF1942. turn on the single player game and start playing
with the AI level and bot-number settings. Notice anything about what
happens to the frame-rate even when only a few bots are visible on the
screen? As the AI level and bot numbers are raised, the poor old CPU
begins to sweat bricks. In fact, you can see it if you have a thermal
monitor on the CPU. By the way, BF1942, like Far Cry has autonomous
-style AI. ...........Q,E,D.

Also, Les pointed out that sophisticated physics engines, like the
Havok rag-doll physics are also a significant additional load on the
CPU.

I would much prefer to have an action-game with excellent frame-rate
and modest graphic-detail, if I have to make a choice, so I always be
pay close attention to future CPU requirements when contemplating
system-updates, especially for systems intended for high-performance
gaming. CPU, GPU, memory, motherboard, power-supply, sound,
cooling efficiency, with items like AGP performance way down the
scale, PCI-Express a big zero, and BTX a big negative in a
high-performance desktop system.

John Lewis
 
John Lewis said:
On Sat, 20 Mar 2004 15:18:30 +0200, "joe smith"
<john.smith@iiuaudhahsyasdy232462643264276asdhfvhdsafhasdgdsagyufasgyufdashu
fdashuyfhuysafhuysafhuydh27324242742647623762667bhfbdsahbvfahds.net>

Now that you have read my original post correctly, let me elaborate
on the fact that the load on the CPU will play a significant role
in the potential throttling of graphics capability in future games.

If you have BF1942. turn on the single player game and start playing
with the AI level and bot-number settings. Notice anything about what
happens to the frame-rate even when only a few bots are visible on the
screen? As the AI level and bot numbers are raised, the poor old CPU
begins to sweat bricks. In fact, you can see it if you have a thermal
monitor on the CPU. By the way, BF1942, like Far Cry has autonomous
-style AI. ...........Q,E,D.

Also, Les pointed out that sophisticated physics engines, like the
Havok rag-doll physics are also a significant additional load on the
CPU.

I would much prefer to have an action-game with excellent frame-rate
and modest graphic-detail, if I have to make a choice, so I always be
pay close attention to future CPU requirements when contemplating
system-updates, especially for systems intended for high-performance
gaming. CPU, GPU, memory, motherboard, power-supply, sound,
cooling efficiency, with items like AGP performance way down the
scale, PCI-Express a big zero, and BTX a big negative in a
high-performance desktop system.

John Lewis

And why exactly will PCI-E be a big zero?

John
 
You might just need a visit to your eye specialist...... Read the
above quote from my original post again, please !!!

First, the CPU is limpdick only on non-graphics related stuff, if the GPU
code is properly done, so why mention CPU at all...? It was in context of
CPU *feeding* the GPU, feeding takes minimal CPU if it is done 'right', so
286 will do for that. If CPU is burned for AI, and other tasks, that is not
related to GPU and irrelevant, etc.

I think you need to consult the know-how-the-GPU-is-programmed specialist,
for instance I am available for consulting. Ask if there is anything
unclear.
 
And why exactly will PCI-E be a big zero?

Apologies, John for not explaining.

If you are:-

(a) building a system from scratch
(b) have access to motherboards, plug-in cards, and drivers
with all desired performance and ZERO price-premium
over equivalent-performance non-PCI-Express.

then it would be perfectly logical to go PCI-Express. The reliability
of the physical-interconnect would be sufficient reason if nothing
else.......unless PCI-Express becomes umbilically tied to BTX. Then
stay away from PCI-Express as long as possible. BTX is Intel-centric
and the thermal air-flow assumes that the video GPU is always
MB-mounted in the cooling duct with the CPU. The provision for cooling
a separate GPU board is worse than in ATX -- take a look at where the
cooling-duct on BTX exhausts some of its heat. And the next-gen
high-end GPUs will have to get rid of at least 100 watts of heat
even if they are on a 0.09nm process. And not all of us want to
invest in water-cooling.

However, PCI-Express is being touted as a must-have by Intel and
certain other interested parties for (er...) performance reasons
(read: new $$$ revenue), which may be true 5-10 years from now, but
currently has no more performance advantage over PCI/AGP (other than
mechanical reliabity) than 8X AGP over 4x AGP.

John Lewis
 
However, PCI-Express is being touted as a must-have by Intel and
certain other interested parties for (er...) performance reasons
(read: new $$$ revenue), which may be true 5-10 years from now, but
currently has no more performance advantage over PCI/AGP (other than
mechanical reliabity) than 8X AGP over 4x AGP.

Same applies to all new techniques, but those to be used, they must exist
first. For Average Consumer who just wants to play, quote, BF1942 it does
not bring anything into the table. But it does bring a whole new world to
the application developers who can leverage the new programming models, if
the new systems are never introduced because they don't bring benefit to the
existing applications, we are stuck with what we have for eternity. Ofcourse
this is not the way things work so it's mostly academic interest why PCI-E
for instance should not be brought into the market.

Low-latency point-to-point switch topography is the thing PC has been
missing since the day it was introduced, and until now, it has not been cost
effective to bring it forward to the desktop for the masses. I'm sure there
are quite a few workstation enthuists who sneer at this prospect, and maybe
even feel just a slight bit 'threatened' because the "PEECEE Toys' are
coming into the territory they have p0wned traditionally. Whatever the
'truth', PCI-E is in practise more than just increased bandwidth. It's a new
way for devices to communicate with each other: where's my latency, dude? :)
 
If you have BF1942. turn on the single player game and start playing
with the AI level and bot-number settings. Notice anything about what
happens to the frame-rate even when only a few bots are visible on the
screen? As the AI level and bot numbers are raised, the poor old CPU
begins to sweat bricks. In fact, you can see it if you have a thermal
monitor on the CPU. By the way, BF1942, like Far Cry has autonomous
-style AI. ...........Q,E,D.

What that has to do with the feeding the GPU, though..? Burning CPU for
other tasks will ofcourse bring the FPS down, but this is not because the
GPU couldn't render the dataset fast enough if the pipeline is properly
written. QED, indeed sir!

Also, Les pointed out that sophisticated physics engines, like the
Havok rag-doll physics are also a significant additional load on the
CPU.

Havoc, sophisticated, all in the same sentence? Uh-huh... why they use euler
integration, then? If that is sophisticated shoot me.
 
What that has to do with the feeding the GPU, though..? Burning CPU for
other tasks will ofcourse bring the FPS down, but this is not because the
GPU couldn't render the dataset fast enough if the pipeline is properly
written. QED, indeed sir!



Havoc, sophisticated, all in the same sentence? Uh-huh... why they use euler
integration, then? If that is sophisticated shoot me.

Euler integration is an extra compute load on the CPU.

Anyway, you have missed the entire point. The capability of today's
GPU are rapidly outstripping the ability of todays' CPUs to feed them
unless an array of CPUs is available. So we have this wonderful GPU
with oh-so-pretty graphics and textures starved for the data required
to render these pretty elements, at least in a single -processor PC --
because it has becomes a lot more busy doing other things.

You can more readily add 8 more pixel pipes to a GPU - just a cut and
paste of an existing design, preferably with a process-shrink --- than
it is to update the CPU core to produce the requisite data to
correctly fill those pipes -----while handling all the additional
complexity that is being thrown at it by modern computer
applications, particularly games. Game development is not frozen just
tweaking prettier and prettier graphics. It had better not be !!

Hold off on your comments for the next year, then come back and
tell me, say mid-2005, where the real bottlenecks to PC gaming
efficiency are.
------------------------------------------------------------------------------
BTW, the efficient disposal of CPU and GPU silicon-heat is very likely
to be an overshadowing limiting factor on both performance and
design-complexity, at least for the next 2-3 years. And will be
painfully obvious by the middle of 2005. The software tools for
generating large silicon have finally outstripped the ability to
practically implement the parts -the limit being the economic removal
of heat. The necessary process-shrinks to combat heat are taking
longer and longer and the related masking charges are growing
exponentially; no such real-time/cost limit on the design toolsets.


John Lewis
 
Anyway, you have missed the entire point. The capability of today's
GPU are rapidly outstripping the ability of todays' CPUs to feed them
unless an array of CPUs is available. So we have this wonderful GPU

Yeah, they sure are, if you have average of 100-1000 primitives per DIP, if
you want the power out, don't try to be too clever and cull to 'save work',
large batches, large batches, large batches.. now, static geometry is
trivial, this is where you possibly cannot saturate the CPU with 'feed the
GPU' workload. No. ****ing. Way.

Unless you are stupid and split the rendering into materials. You need to
preprocess and consolidate the data, so, that when you are not fer' instance
repeat texture, map more than single texture into single surface and adjust
texture coordinates according to remapped texel residence. Et cetera:
minimize the state changes, state changes are not alone the factor that
limits how much 'stuff' the GPU can process: it is the fact that DIP's are
exensive, either you deal with it, or you don't. If you don't deal with it,
then you suck and deserve the crappy performance.

Some people design the 'engine' and rendering pipeline around virtual,
completely abstract requirements.. some people design their rendering
pipelines around what the hardware can do and how it does it. When you do
things the way hardware works (ie. large batches) then there are no
arbitrary limits set by poor engineering choises along the way. It is
commonly seen that old practises and experiences are transfered to new
platforms as-is, this doesn't work and initially the quality is not very
optimal.. that is normal, but when you work one year, two years, then three
years on a new platform it should begin to sink in what is feasible and what
isn't. If the goal is just to get the basic, vanilla workload done, then
optimization on rendering pipeline architechture are irrelevant because the
stuff is Fast Enough *anyway* .. this kind of code, when it tries to scale
up when the number of primitives increase might hit a wall at some point.
Too bad, tough shit, get over it. :)
 
For what it's worth, dynamic meshes are more 'CPU' hostile, and a lot of
Good Looking Slick stuff is dynamic. I won't go into detals why it in some
instances can be preferable to fill dynamic VB with CPU and play it back
with vertex program which only lerps between keyframes, say, every 6 frames
for example (too long period and animations will have delay, say, you hit
guy with gun and the reaction is at worst 6 frames behind, at 60 fps this
means 100 ms, 0.1 seconds, which is noticeable but still not too bad ;
average 50 ms reaction time is a-okay).

Anyways, skinning with palette skinning is infeasible on only but few 3d
chips which are not "mainstream" so I wouldn't count on that too much. On
the other hand the number of constant registers is a bit limited so you
can't fit too much stuff there either, unless you break the workload (outch,
just what we don't want to do) to smaller batches where we have enough
constant registers. So wassup with that? Then we resort to tweening (lerp)
and fill in the animation every N frames (like I said, 6 is okay for
instance.. make it configurable, all more the power to you..) and this way
the workload can be amortized over time, or, if have a lot of objects then
fill-at-once is also okay if the animation system manages uniform
distribution of update frames over time (this is what I choose to do).

The VS 3.0 introduces samplers to vertex programs: this means the vertex
programs will have memory latency of dependent texture lookups to deal with,
but on the good side it offers ability to fit 'massive' amount of
floating-point data, accessible to the vertex program into the GPU local
memory. You don't need to be Einstein to figure out what this means to the
quality of graphics output we achieve with less brainwork for the design so
that CPU effect is minimized.

Now, actually, I have kind of a problem that I am not CPU limited. Never
been with DX9 and hardware which implements the necessary feature set. It's
funny, but you can easily run out of GPU steam with ~1 Ghz Athlon
processor.. however it is more likely that you run out of fillrate before
you starve of vertex input to the fragment program. But let's hypothetically
assume we run out of vertex processing power in the GPU, this means our
level-of-detail handing is sub-optimal at best. Some do geomorphing, some
don't.. those who don't, the lod levels "pop, pop, pop, crackle" that's fine
on commercial quality game, it's common knowledge that consumers whine only
about things they know to whine about.. things like lack of 32 bit color
(anyone? ), lack of precision in the fragment programs (anyone cares to
remember the 3DMark03 fiasco with NVIDIA last year?) et cetera et cetera ad
nauseum.

Intelligent, fast, GPU friendly zero CPU time lod system is possible and
infact ****ing easy to write for GPU's in a way that it does not waste GPU
power either. And no popping, seamless, smooth translation from virtual lod
level to the next. I leave it as exercise to the reader to figure out how..
I never seen paper published about the technique we use nor seen the usual
sources which publish everything they can get their hands on either. But
it's trivial, and it's fast, and it's smooth. So make no mistake that I
claim to be super-cool programmer, on the contrary, it is so simple that I
am baffled why it is not done more often. Maybe it is done more often,
people just don't want to talk about it because it is way ****ing too
trivial to figure out as soon as get the idea into the head that it is
possible in the first place (no, it's really nothing you could read about in
any textbook or tutorial, don't try to give any crap about that :)

Anyways, the point is that dynamic stuff is possible, too, without
significant CPU speedhit. Hell, it is possible with zero cpu hit after the
new NV and ATI cards hit the streets (on those cards anyway :)

But whatever, not like I'd care how shitty code some might write anyway. ;-)
 
For what it's worth, dynamic meshes are more 'CPU' hostile, and a lot of
Good Looking Slick stuff is dynamic. I won't go into detals why it in some
instances can be preferable to fill dynamic VB with CPU and play it back
with vertex program which only lerps between keyframes, say, every 6 frames
for example (too long period and animations will have delay, say, you hit
guy with gun and the reaction is at worst 6 frames behind, at 60 fps this
means 100 ms, 0.1 seconds, which is noticeable but still not too bad ;
average 50 ms reaction time is a-okay).

Anyways, skinning with palette skinning is infeasible on only but few 3d
chips which are not "mainstream" so I wouldn't count on that too much. On
the other hand the number of constant registers is a bit limited so you
can't fit too much stuff there either, unless you break the workload (outch,
just what we don't want to do) to smaller batches where we have enough
constant registers. So wassup with that? Then we resort to tweening (lerp)
and fill in the animation every N frames (like I said, 6 is okay for
instance.. make it configurable, all more the power to you..) and this way
the workload can be amortized over time, or, if have a lot of objects then
fill-at-once is also okay if the animation system manages uniform
distribution of update frames over time (this is what I choose to do).

The VS 3.0 introduces samplers to vertex programs: this means the vertex
programs will have memory latency of dependent texture lookups to deal with,
but on the good side it offers ability to fit 'massive' amount of
floating-point data, accessible to the vertex program into the GPU local
memory. You don't need to be Einstein to figure out what this means to the
quality of graphics output we achieve with less brainwork for the design so
that CPU effect is minimized.

Now, actually, I have kind of a problem that I am not CPU limited. Never
been with DX9 and hardware which implements the necessary feature set.

So the graphics code that you write is auto-aware of the performance
of the GPU and CPU and automatically removes graphical elements
non-essential to game-play.... assuming that you write your graphical
code for games ? Like the Far Cry programmers have deliberately done
and fairly successfully. Such flexibility is necessary to preserve
frame-rate while accommodating a sufficient range of graphics
hardware to assure adequate sales-volume !!! Lack of frame-rate
in a FPS is a killer (literally) regardless of the prettiness of the
graphics..

What you have been describing so far seems to assume a pretty complete
GPU hardware implementation of a still-evolving DX9+ standard. Neither
NVidia nor Ati have complete hardware implementations.That situation
will continue with the next-gen chips, as it is much quicker to evolve
a standard that it is for the silicon to implement it.

John Lewis
 
Back
Top