Well I still have seen no real answer.
Yes, you have. You just can't recognize them. Because of this, I will
explain things to you in "talking downwards" fashion because seemingly
taking you as someone who has basic grasp of the concepts involved did not
work for everyone who responded to you.
The basic 'wrong' in your question is what the CPU and GPU are designed to
do. If you just look raw number of scalar computations CPU at 2-3 Ghz and
GPU at ~500 Mhz can do per second the average GPU wins the CPU every single
time. GPUs are brutally efficient parallel signal processors.
GPU's win for raw processing power, but they have limitations. First, GPU's
have fixed feature set and are not generic purpose programmable processing
units like CPU's. This means there is no programmable flow control affecting
Program Counter (PC) also known as Instruction Pointer (IP). Latest GPU's
have dynamic branching etc. but the input for each fragment is still
configured the same and the IP is strictly just repeatedly executing the
same series of instructions (branching omitted). The currently real-world
deployed fragment and vertex processors implement "branching" as computing
both paths and discarding the results of the path that was "not taken". But
these are just details ; the important bit is that the GPU is specialized in
the kind of parallel computations generic purpose CPU's are very poor at.
CPU's do have SIMD instruction sets and possibility to handle multiple
elements of data simultaneously, adding into the mix are details such as
pipelining which means there are number of instructions "in flight"
simultaneously which means there are multiply elements of data being
processed simultaneously but this is not the same thing as processing
multiple elements of data in parallel (the 'model' the software is written
is still serial).
Can I safely conclude it's non determinstic... in other words people dont
know shit.
Oh it is very deterministic and precise. It just so happens that there are
literally hundreds of different GPU's and CPU's out there so giving you
one-fits-all answer is practically impossible. But if we limit the choises a
little bit, given a budget of $500 for a given task, GPU does yield superior
bang for buck for doing per fragment computations and CPU yields superior
bang for buck for generic purpose programmability.
If you look at only the raw computing power per cost unit then GPU is order
of magnitude 'superior' to the CPU. But since it can only do a certain kinds
of jobs CPU's are still kicking strong and very important component on
contemporary personal computer.
So the thruth to be found out needs testing programs !
Nonsense.
Test it on P4
Test it on GPU
And then see who's faster.
'****ing old' Voodoo2 graphics card will beat P4 for rasterization for
feature set the Voodoo2 supports. When we start doing work the Voodoo2 does
not support, the P4 will 'win' simple because the P4 can do things the
Voodoo2 was never designed to do in the first place.
If we simply talk about per fragment computations and with latest generation
GPU's geometry related work then the CPU has a change of a snowball in hell
beating the GPU. This is so obvious graphics programmers never even talk
about the topic, there is NOTHING to talk about. More and more work is moved
to the GPU now that their programmability increases steadily. The nVidia
GeForce 6800 generation hardware will be able to sample from textures in
vertex shader, this will be a Big feature for shader programmers. But that
is outside the scope of your question anyway.
Since I don't write games it's not interesting for me.
I do hope game writers will be so smart to test it out
They don't really have to, it's Common Knowledge, and I don't mean virtually
everyone takes it granted because 'I've heard from some bloke who said so',
but because it's so ****ing fundamental basics that it's unavoidable that
this is determined VERY early on anyone's incursion to GPU programming TO
BEGIN WITH.
Why?
I give a practical example. If you have a very basic filter, let's say ONLY
bilinear filter for sampling from textures. When you implement this with
CPU, you will basicly have to write something like this:
color = color0 * weight0 + color1 * weight1 + color2 * weight2 + color3 *
weight3;
Where color0 through color3 are four color samples from the texture from a
2x2 block, the weight0 though weight3 are computed from the fractional
texture coordinates in horizontal and vertical directions. This same
computation can also be done with three linear interpolations, two in one
dimension and one in other dimension. The point is that the blending alone
is four multiplications and three additions PER COLOR COMPONENT, where we
usually have four components (red, green, blue and alpha). Therefore the CPU
would have to do for the color mixing alone for bilinear filter 28
aritchmetic operations. Not to mention how to interpolate the texture
coordinates, how to compute the weight factors from fractional coordinates
and so on.
The CPU is serial: it has to do all these computations one after another.
The superscalarity and other factors can change this fact physically, but it
won't do magic. Now let's look at the GPU how it hacks this problem.
The GPU is cunning little ****er, it has own dedicated transistors for this
work. When you want to sample from texture, the hardware will use
transistors allocated for the job and do the FILTERED lookup virtually free.
Now, there is LATENCY in memory.. the results don't come out immediately,
but the GPU is processing pixels in parallel and knows what pixel will be
processed next and so on. If the GPU pixel processor is pipelined (you can
bet your ****ing ass it is) it can do the FILTERED lookup while it is
processing PREVIOUS pixel still with other parts of the hardware. When the
part of hardware that wants the filtered color value need the color value,
it has arrived where the value is needed. The key idea here is that the
FILTER "unit" in the chip does only look up filtered color values for other
parts of the chip.
This means that the computations are DELAYED, there is delay.. but it does
not matter because it is not critical if the results arrive a little bit
later.. no one any worse off because of this arrangement because of the
nature of the work. The work is to fill pixels with certain color and do a
LOT of that computation in given time. CPU on the other hand must do every
single step of the job as quickly as possible, because next instructions
rely on the previous ones being completed (if the results are needed, if
not, ofcourse instructions can be executed in different order which is why
it is called out-of-order execution
The key point here is that the
computations do not have DEPENDENCIES from earlier computations, each
fragment is unique entity and only intra-fragment computations matter hence
it is possible to speed up the design to speeds CPU can only dream of for
this kind of computational work. Example follows.
Now if you want to fill a 100 pixel triangle, and each pixel takes
approximately 40 clock cycles to complate it means we need 4000 clock cycles
to fill the triangle. This is very optimistic because memory latency will
make situation MUCH worse, but let's give the CPU as much advantage as we
can.
Now, let's look at how GPU does shit. Let's assume it takes 40 clock cycles
for the GPU aswell per pixel. Hell, let's give the GPU 200 clock cycles per
pixel (5x slower!!!). The GPU will still win the CPU hands down, even if it
is 5 times slower? You know why?
This is MAGIC! Look closely:
First 200 clock cycles are spent for the first pixel, then the color is
ready. But every clock cycle we becan work on next pixel.. so in the end we
have up to 200 clock cycles "in flight" (being processed). Now at cycle 200,
we still have 199 pixels to fill.. so the total time for our work will be
399 clock cycles. This is 10 times shorter time than what CPU used for the
job, even when cost per pixel was 5 times lower!
OK, getting 200 pixels to execute simutaneously would be GPU design that
does not really exist.. it would assume there would be 200 stages in the
fragment pipeline which is not even near the truth.. but I used this as
example to demonstrate you how GPU has "unfair" advantage over CPU for pixel
work. The real situation is there are are closer to tens than hundreds of
stages.. but each stage is very fast and "free" because each stage has own
transistors doing the computation. CPU has only so many adders, multipliers,
shifters and so on it can use simultaneously. The key principle for
efficient CPU design is to use as much of those units simutaneously as
possible. This is why Pentium PRO and later Intel processors break the IA32
instructions to internal micro ops (PPRO - P3) and Pentium4 goes even
further it has own microcode which the IA32 code is translated dynamically
on-fly but this is very expensive so Pentium4 design team implemented area
in Pentium4 chip where translated code is stored, this is called "trace
cache" you might have heard of it.
This over simplifies the situation A LOT, and I could write all day long to
fill-in the gaps to be more precise for the sake of vanity and avoid
procecution by my peers and colleagues but since who know who I am know
what I know I don't quite see the point.
Now is your question satisfied?