A
AirRaid
11 pages!
http://dpad.gotfrag.com/portal/story/35372/?cpage=1
from page 4
"#
How is one CPU better than another?
GFLOPS is something that gets thrown around a lot, but it should be
clear that the peak theoretical GFLOP numbers for both these CPUs are:
# 115GFLOPS Theoretical Peak Performance for 360 CPU
# 218GFLOPS Theoretical Peak Performance for PS3 CPU.
These CPU theories will not be achieved in real world performance. What
IBM did when testing for theoretical peaks on both CPUs can't really be
considered as representative of how the processors would actually
perform in real world situations, because of the type of testing done
is too controlled. It's a much too perfect of an environment and game
development is going to involve an unforgiving environment that
doesn't cater so well to the perfect environment the CPUs were tested
under.
The GFLOP numbers for the PS3 were calculated based on 8 running SPE,
so the fact that the PS3 uses only 6 SPE for game applications lowers
the peak theoretical even further, as majority of the floating point
work on the PS3's CPU is done via the SPE. Each SPE has a peak
theoretical of 25.6GFLOPS. So the total peak theoretical performance
for all 6 SPE would be 153.6GFLOPS, but why is that number also not
achievable?
In IBM's controlled testing environment, their optimized code on 8
SPE only yielded a performance number of 155.5GFLOPS. If it took 8 SPE
to achieve that, no way 6 will be able to and that testing was done in
a fashion that didn't model all the complexities of DMA and the
memory system. Using a 1Kx1K matrix and 8 SPE they were able to achieve
73.4GFLOPS, but the PS3 uses 6 SPE for games and these tests were done
in controlled environments. So going on this information, even
73.4GFLOPS is seemingly out of reach, showing us that Sony didn't
necessarily lie about the cell's performance as they made clear the
218GFLOPS was "theoretical." But just like Microsoft they
definitely wanted you to misinterpret these numbers into believing they
were achievable.
Even while taking all of this into consideration, the CPUs can't
reach those crazy performance numbers; the PS3's cell still
comfortably comes out on top in terms of overall floating point
capability, but it should be known that the available power on the
PS3's cell will be significantly more difficult to harness than the
available power on the 360's CPU.
It's also worth mentioning that even the PS2 CPU had more than twice
the GFLOPS of the original Xbox's CPU, but it didn't necessarily
lead it to being the performance winner. This time around, while the
cell has the GFLOPS advantage, its advantage isn't quite as big as
the PS2 CPU had on the Xbox. This teaches us that there is more than
one meter of real world performance.
The PS3's cell processor has 1 Power PC core similar to that of the 3
Power PC cores sustaining the 360's 3 core design (without the
vmx-128 enhancements available on each of the 360's cores) and 7 SPE
(synergistic processing element). The 8th is disabled to improve
yields. One of the SPE is used to run the PS3's operating system
while the other 6 are available for games. The reason the PS3's CPU
will be significantly more difficult to program for is because the CPU
is asymmetric, unlike the 360's CPU. Because of the PS3 CPU only
having 1 PPE compared to the 360's 3, all game control, scripting, AI
and other branch intensive code will need to be crammed into two
threads which share a very narrow execution core and no instruction
window. The cell's SPE will be unable to help out here as they are
not as robust; hence, not fit for accelerating things such as AI, as
it's fairly branch intensive and the SPE lacks branch prediction
capability entirely.
I'm sure people remember from the section detailing how the 360 and
PS3's processors are less robust compared to processors we use on our
desktop computers and the consequences of being in order execution.
Well the PS3's SPE are further stripped down than even the Power PC
Cores and, as a result, isn't as capable of handling as many
different types of code like the 1 Power PC Core available on the
PS3's cell or the 3 Power PC Cores available on the 360's CPU. The
problem with being asymmetric is when you program for the Power PC Core
on the PS3 CPU, the method of programming you used to get the most out
of that Power PC core is no longer effective when breaking off tasks
for the SPE to work on. Going from the PPE to the SPE on the PS3
requires a different compiler and a different set of tools.
When you come to the realization that the key to making up for the CPU
is in-order execution is the rather complicated parallel programming,
you realize that the CPU being asymmetric and having just a single PPE
makes something that was already extremely difficult even more
difficult. So a developer's job is harder when you factor in that the
PS3 has a 512KB L2 cache which is half the size of the 360 CPU's 1MB
L2 cache... that single PPE the PS3 CPU has isn't receiving much help
with branches in the cache department.
Microsoft made a better decision from the perspective of the developer;
it's still difficult, but much easier compared to working with the Cell
architecture. The 360's CPU isn't asymmetric like the PS3's cell
and has 3 PPE as opposed to 1, but all 3 are robust enough to help
handle the type of code only the PS3's single PPE is capable of
handling. When Microsoft says they have three times the general purpose
processing power this is what they mean. Based on the simple fact that
the 360 has 3 Power PC cores to the PS3's 1, more processing power
can be dedicated to helping with things such as game control AI,
scripting and other types of branch intensive code.
from the same memory pool and they're synchronized, in addition to
being cache coherent. You can just create an extra thread right in your
program and have it do some work. This allows the developer to create
very nice structures so if you know how to get the best possible
performance out of one core you know how to get the best possible
performance out of all 3 because they operate in perfect synch.
Each core on the 360's processor is capable of performing 2 threads
each (Think of it as similar to hyper threading), so the 360's CPU is
capable of handling 6 simultaneous running threads at once. This brings
me to a very important advantage for the PS3's Cell CPU, its
concurrency. While the 360 CPU may be able to handle 6 processor
threads simultaneously it still only has 3 physical CPU cores so every
2 threads must share processing power on a single core. Whereas with
the PS3, it has 1 PPE and 6 SPE for games, which are like extra
physical processors). If each of the PS3's 6 SPE used for games are
working on a specific task such as collision, cloth physics, animation,
water surface simulation or particles, they wouldn't need to worry
about processing power being taken away from another part of the game
because the SPE don't share processing power.
The only cause for concern would be the 512KB L2 cache being shared by
7 simultaneous running SPE and a PPE, but that's what developers are
for; they work around things like this. In practice, this should allow
PS3 games to potentially have more things going on at once than 360
games. Ignoring the difficulties of programming for the PS3 CPU, it
should be known that the PS3's CPU is very good when it comes to
vertex-related operations because the PS3's CPU handles graphics code
better than the 360's CPU. It is also possible that through good
parallelism of physics code on the SPE that physics code could also run
better on the PS3 CPU due to the concurrency advantage.
The 360 CPU however, due to its 3 symmetric General Purpose Cores, is
not only much easier to program for than the cell, but having 3 PPE
capable of handling things such as AI also means the 360's CPU will
be the better of the 2 CPUs when it comes to AI code. Either way we can
look forward to great things from both CPUs in the future.
Before I end off, I'd like to point out a game that in my opinion,
from a technical standpoint, is one the most brilliant uses of the
PS3's CPU. All things considered, such as in-order execution and the
other complications of the architecture, Heavenly Sword is quite the
standout in nearly every regard: incredible combat animations, awesome
group enemy AI, and great physics. At the very least this is what I
gathered from seeing videos of the E3 demo; it's a reminder that
regardless of the challenges, there are developers that are up to the
challenge and its only going to get better with time."
from pages 7, 8, 9
RSX (PS3GPU) & Xenos (360GPU)
Alright let's get underway the GPU inside the PS3 is NV47 based which
is another name for the 7800GTX. It has 24 pixel shader pipelines and 8
vertex shader pipelines. It's capable of 136 shader operations per
clock and according to Sony it has 256MB of GDDR3 memory at 700MHZ and
performs 74.8 billion shader operations per second. Sony also said
it's capable of 1.8 teraflops, which I can tell everyone right now
with 100% confidence isn't true (numbers game) I'm not entirely
sure of all the little tricks they used to arrive at such an extreme
flops number, but rest assured it isn't a type of a performance this
GPU will ever really achieve. PC videocards such as the X1900XTX have
far more raw horsepower than either of the 2 videocards in either
console and is pushing a GPU clock speed of up to 650MHZ (some have
shipped at 675MHZ) along with 24 more pixel shader pipelines and yet
the X1900XTX is just over 500GFLOPS so to even begin entertaining the
thought that a less advanced GPU with significantly less raw power
could brute force 1.3 teraflops better performance is wishful thinking,
but there is no cause to be angry at Sony in this case as they are
entitled to market their product regardless of how they choose to do
it. As long as they avoid disturbingly untrue statements about the
competition its all fair game as far as I'm concerned)
I'm sure some people are wondering how Sony came to the conclusion
that the RSX does 136 shader operations per clock or even 74.8 billion
shader ops per second? Easy
# The RSX has 24 pixel pipes (each of which performs 5.7 ops) 5.7ops
*24 Pixel Pipelines=136.8 shader ops per clock.
# The RSX is clocked at 550MHZ *136 shader ops per clock =74800 (or
74,800,000,000)
There is talk and even an event which took place in Japan in which Sony
attended claiming that the RSX will no longer be 550MHZ and it will
instead be clocked at 500MHZ and the 256MB of GDDR3 will now be @650MHZ
instead of 700. Now there is a lot pointing to this being true, but
Sony still hasn't officially come out and admitted so I'm not sure
what to think, but this is a perfect opportunity to see if we learned
how to calculate this stuff.
If the RSX is clocked at 500MHZ*136 shader ops per clock that would
make the new shader operations per second for the RSX 68 billion
instead of the original 74.8 billion weakening the GPU's performance,
but I guess we wont truly find out till the PS3 releases because if
anyone has noticed Sony has never posted the RSX clockspeed on the
official ps3 site nor did they re-iterate the RSX clockspeed at E3 06.
The RSX has 20.8GB/s of video memory bandwidth from the GDDR3 ram. The
RSX has an extra 32 GB/sec writing to the system's main memory. If the
RSX can fully utilize the memory system it can achieve pushing out
58.2GB/s worth of pixel rendering to memory. The RSX is pretty much a
7800GTX class GPU in some cases its worse in some cases better, nothing
that is really new. Now the same can't be said about the 360's GPU
at all.
Now the 360's GPU is one impressive piece of work and I'll say from
the get go it's much more advanced than the PS3's GPU so I'm not
sure where to begin, but I'll start with what Microsoft said about
it. Microsoft said Xenos was clocked at 500MHZ and that it had 48-way
parallel floating-point dynamically-scheduled shader pipelines (48
unified shader units or pipelines) along with a polygon performance of
500 Million triangles a second.
Before going any further I'll clarify this 500 Million Triangles a
second claim. Can the 360's GPU actually achieve this? Yes it can,
BUT there would be no pixels or color at all. It's the triangle setup
rate for the GPU and it isn't surprising it has such a higher
triangle setup rate due to it having 48 shaders units capable of
performing vertex operations whereas all other released GPUs can only
dedicate 8 shader units to vertex operations. The PS3 GPU's triangle
setup rate at 550MHZ is 275 million a second and if its 500MHZ will
have 250 million a second. This is just the setup rate do NOT expect to
see games with such an excessive number of polygons because it wont
happen.
Microsoft also says it can also achieve a pixel-fillrate of
16Gigasamples per second. This GPU here inside the Xbox 360 is
literally an early ATI R600, which when released by ATI for the pc will
be a Directx 10 GPU. Xenos in a lot of areas manages to meet many of
the requirements that would qualify it as a Directx 10 GPU, but falls
short of the requirements in others. What I found interesting was
Microsoft said the 360's GPU could perform 48 billion shader
operations per second back in 2005. However Bob Feldstein, VP of
engineering for ATI, made it very clear that the 360's GPU can
perform 2 of those shaders per cycle so the 360's GPU is actually
capable of 96 billion shader operations per second.
To quote ATI on the 360's GPU they say.
"On chip, the shaders are organized in three SIMD engines with 16
processors per unit, for a total of 48 shaders. Each of these shaders
is comprised of four ALUs that can execute a single operation per
cycle, so that each shader unit can execute four floating-point ops per
cycle."
# 48 shader units * 4 ops per cycle = 192 shader ops per clock
# Xenos is clocked at 500MHZ *192 shader ops per clock = 96 billion
shader ops per second.
(Did anyone notice that each shader unit on the 360's GPU doesn't
perform as many ops per pipe as the rsx? The 360 GPU makes up for it by
having superior architecture, having many more pipes which operate more
efficiently and along with more bandwidth.)
Did Microsoft just make a mistake or did they purposely misrepresent
their GPU to lead Sony on? The 360's GPU is revolutionary in the
sense that it's the first GPU to use a Unified Shader architecture.
According to developers this is as big a change as when the vertex
shader was first introduced and even then the inclusion of the vertex
shader was merely an add-on not a major change like this. The 360's
GPU also has a daughter die right there on the chip containing 10MB of
EDRAM. This EDRAM has a framebuffer bandwidth of 256GB/s which is more
than 5 times what the RSX or any GPU for the pc has for its framebuffer
(even higher than G80's framebuffer).
Thanks to the efficiency of the 360 GPU's unified shader architecture
and this 10MB of EDRAM the GPU is able to achieve 4XFSAA at no
performance cost. ATI and Microsoft's goal was to eliminate memory
bandwidth as a bottleneck and they seem to have succeeded. If there are
any pc gamers out there they notice that when they turn on things such
as AA or HDR the performance goes down that's because those features
eat bandwidth hence the efficiency of the GPU's operation decreases
as they are turned on. With the 360 HDR+4XAA simultaneously are like
nothing to the GPU with proper use of the EDRAM. The EDRAM contains a
3D logic unit which has 192 Floating Point Unit processors inside. The
logic unit will be able to exchange data with the 10MB of RAM at 2
Terabits a second. Things such as antialiasing, computing z depths or
occlusion culling can happen on the EDRAM without impacting the GPU's
workload.
Xenos writes to this EDRAM for its framebuffer and it's connected to
it via a 32GB/sec connection (this number is extremely close to the
theoretical because the EDRAM is right there on the 360 GPU's
daughter die.) Don't forget the EDRAM has a bandwidth of 256GB/s and
its only by dividing this 256GB/s by the initial 32GB/s that we get
from the connection of Xenos to the EDRAM we find out that Xenos is
capable of multiplying its effective bandwidth to the frame buffer by a
factor of 8 when processing pixels that make use of the EDRAM, which
includes HDR or AA and other things. This leads to a maximum of
32*8=256GB/s which, to say the least, is a very effective way of
dealing with bandwidth intensive tasks.
In order for this to be possible developers would need to setup their
rendering engine to take advantage of both the EDRAM and the available
onboard 3D logic. If anyone is confused why the 32GB/s is being
multiplied by 8 its because once data travels over the 32GB/s bus it is
able to be processed 8 times by the EDRAM logic to the EDRAM memory at
a rate of 256GB/s so for every 32GB/s you send over 256GB/s gets
processed. This results in RSX being at a bandwidth disadvantage in
comparison to Xenos. Needless to say the 360 not only has an
overabundance of video memory bandwidth, but it also has amazing memory
saving features. For example to get 720P with 4XFSAA on traditional
architecture would require 28MB worth of memory. On the 360 only 16MB
is required. There are also features in the 360's Direct3D API where
developers are able to fit 2 128x128 textures into the same space
required for one, for example. So even with all the memory and all the
memory bandwidth, they are still very mindful of how it's used.
I wasn't too clear earlier on the difference between the RSX's
dedicated pixel and vertex shader pipelines compared to the 360s
unified shader architecture. The 360 GPU has 48 unified pipelines
capable of accepting either pixel or vertex shader operations whereas
with the older dedicated pixel and vertex pipeline architecture that
RSX uses when you are in a vertex heavy situation most of the 24 pixel
pipes go idle instead of helping out with vertex work.
Or on the flip side in a pixel heavy situation those 8 vertex shader
pipelines are just idle and don't help out the pixel pipes (because
they aren't able to), but with the 360's unified architecture in a
vertex heavy situation for example none of the pipes go idle. All 48
unified pipelines are capable of helping with either pixel or vertex
shader operations when needed so as a result efficiency is greatly
improved and so is overall performance. When pipelines are forced to go
idle because they lack the capability to help another set of pipelines
accomplish their task it's detrimental to performance. This
inefficient manner is how all current GPUs operate including the PS3's
RSX. The pipelines go idle because the pixel pipes aren't able to help
the vertex pipes accomplish a task or vice versa. Whats even more
impressive about this GPU is it by itself determines the balance of how
many pipelines to dedicate to vertex or pixel shader operations at any
given time a programmer is NOT needed to handle any of this the GPU
takes care of all this itself in the quickest most efficient way
possible. 1080p is not a smart resolution to target in any form this
generation, but if 360 developers wanted to get serious about 1080p,
thanks to Xenos, could actually outperform the ps3 in 1080p. (The less
efficient GPU always shows its weaknesses against the competition in
higher resolutions so the best way for the rsx to be competitive is to
stick to 720P) In vertex shader limited situations the 360's gpu will
literally be 6 times faster than RSX. With a unified shader
architecture things are much more efficient than previous architectures
allowed (which is extremely important). The 360's GPU for example is
95-99% efficient with 4XAA enabled. With traditional architecture there
are design related roadblocks that prevent such efficiency. To avoid
such roadblocks, which held back previous hardware, the 360 GPU design
team created a complex system of hardware threading inside the chip
itself. In this case, each thread is a program associated with the
shader arrays. The Xbox 360 GPU can manage and maintain state
information on 64 separate threads in hardware. There's a thread buffer
inside the chip, and the GPU can switch between threads instantaneously
in order to keep the shader arrays busy at all times.
Want to know why Xenos doesn't need as much raw horsepower to
outperform say something like the x1900xtx or the 7900GTX? It makes up
for not having as much raw horsepower by actually being efficient
enough to fully achieve its advertised performance numbers which is an
impressive feat. The x1900xtx has a peak pixel fillrate of
10.4Gigasamples a second while the 7900GTX has a peak pixel fillrate of
15.6Gigasamples a second. Neither of them is actually able to achieve
and sustain those peak fillrate performance numbers though due to not
being efficient enough, but they get away with it in this case since
they can also bank on all the raw power. The performance winner between
the 7900GTX and the X1900XTX is actually the X1900XTX despite a lower
pixel fillrate (especially in higher resolutions) because it has twice
as many pixel pipes and is the more efficient of the 2. It's just a
testament as to how important efficiency is. Well how exactly can the
mere 360 GPU stand up to both of those with only a 128 bit memory
interface and 500MHZ? Well the 360 GPU with 4XFSAA enabled achieves AND
sustains its peak fillrate of 16Gigasamples per second which is
achieved by the combination of the unified shader architecture and the
excessive amount of bandwidth which gives it the type of efficiency
that allows it to outperform GPUs with far more raw horsepower. I guess
it also helps that it's the single most advanced GPU currently
available anyway for purchase. Things get even better when you factor
in the Xenos' MEMEXPORT ability which allows it to enable
"streamout" which opens the door for Xenos to achieve DX10 class
functionality. A shame Microsoft chose to disable Xenos' other 16
pipelines to improve yields and keep costs down. Not many are even
aware that the 360's GPU has the exact same number of pipelines as
ATI's unreleased R600, but to keep costs down and to make the GPU
easier to manufacture, Microsoft chose to disable one of the shader
arrays containing 16 pipelines. What MEMEXPORT does is it expands the
graphics pipeline in more general purpose and programmable manner.
I'll borrow a quote from Dave Baumann since he explains it rather
well.
"With the capability to fetch from anywhere in memory, perform
arbitrary ALU operations and write the results back to memory, in
conjunction with the raw floating point performance of the large shader
ALU array, the MEMEXPORT facility does have the capability to achieve a
wide range of fairly complex and general purpose operations; basically
any operation that can be mapped to a wide SIMD array can be fairly
efficiently achieved and in comparison to previous graphics pipelines
it is achieved in fewer cycles and with lower latencies. For instance,
this is probably the first time that general purpose physics
calculation would be achievable, with a reasonable degree of success,
on a graphics processor and is a big step towards the graphics
processor becoming much more like a vector co-processor to the CPU."
Even with all of this information there is still a lot more about this
GPU that ATI just simply isn't revealing and considering they'll be
borrowing technology used to design this GPU in their future pc
products can you really blame them?
from page 11
Conclusion
Hopefully this article has helped to dispell some rumors surrounding
the processing power of these two great consoles and demonstrate some
of the differences that give them their unique feel. There are many
attempts on both sides to distort the numbers or misconstrue their
importance, but looking at the features as a whole allows the
opportunity to determine how these consoles will operate overall. While
both consoles shine in some areas, they do have their softer spots.
Ultimately, the good features of each of these consoles outweigh the
bad and the amount of high quality games being released this winter
will give Sony and Microsoft fans alike a lot to be happy about.
http://dpad.gotfrag.com/portal/story/35372/?cpage=1
from page 4
"#
How is one CPU better than another?
GFLOPS is something that gets thrown around a lot, but it should be
clear that the peak theoretical GFLOP numbers for both these CPUs are:
# 115GFLOPS Theoretical Peak Performance for 360 CPU
# 218GFLOPS Theoretical Peak Performance for PS3 CPU.
These CPU theories will not be achieved in real world performance. What
IBM did when testing for theoretical peaks on both CPUs can't really be
considered as representative of how the processors would actually
perform in real world situations, because of the type of testing done
is too controlled. It's a much too perfect of an environment and game
development is going to involve an unforgiving environment that
doesn't cater so well to the perfect environment the CPUs were tested
under.
The GFLOP numbers for the PS3 were calculated based on 8 running SPE,
so the fact that the PS3 uses only 6 SPE for game applications lowers
the peak theoretical even further, as majority of the floating point
work on the PS3's CPU is done via the SPE. Each SPE has a peak
theoretical of 25.6GFLOPS. So the total peak theoretical performance
for all 6 SPE would be 153.6GFLOPS, but why is that number also not
achievable?
In IBM's controlled testing environment, their optimized code on 8
SPE only yielded a performance number of 155.5GFLOPS. If it took 8 SPE
to achieve that, no way 6 will be able to and that testing was done in
a fashion that didn't model all the complexities of DMA and the
memory system. Using a 1Kx1K matrix and 8 SPE they were able to achieve
73.4GFLOPS, but the PS3 uses 6 SPE for games and these tests were done
in controlled environments. So going on this information, even
73.4GFLOPS is seemingly out of reach, showing us that Sony didn't
necessarily lie about the cell's performance as they made clear the
218GFLOPS was "theoretical." But just like Microsoft they
definitely wanted you to misinterpret these numbers into believing they
were achievable.
Even while taking all of this into consideration, the CPUs can't
reach those crazy performance numbers; the PS3's cell still
comfortably comes out on top in terms of overall floating point
capability, but it should be known that the available power on the
PS3's cell will be significantly more difficult to harness than the
available power on the 360's CPU.
It's also worth mentioning that even the PS2 CPU had more than twice
the GFLOPS of the original Xbox's CPU, but it didn't necessarily
lead it to being the performance winner. This time around, while the
cell has the GFLOPS advantage, its advantage isn't quite as big as
the PS2 CPU had on the Xbox. This teaches us that there is more than
one meter of real world performance.
The PS3's cell processor has 1 Power PC core similar to that of the 3
Power PC cores sustaining the 360's 3 core design (without the
vmx-128 enhancements available on each of the 360's cores) and 7 SPE
(synergistic processing element). The 8th is disabled to improve
yields. One of the SPE is used to run the PS3's operating system
while the other 6 are available for games. The reason the PS3's CPU
will be significantly more difficult to program for is because the CPU
is asymmetric, unlike the 360's CPU. Because of the PS3 CPU only
having 1 PPE compared to the 360's 3, all game control, scripting, AI
and other branch intensive code will need to be crammed into two
threads which share a very narrow execution core and no instruction
window. The cell's SPE will be unable to help out here as they are
not as robust; hence, not fit for accelerating things such as AI, as
it's fairly branch intensive and the SPE lacks branch prediction
capability entirely.
I'm sure people remember from the section detailing how the 360 and
PS3's processors are less robust compared to processors we use on our
desktop computers and the consequences of being in order execution.
Well the PS3's SPE are further stripped down than even the Power PC
Cores and, as a result, isn't as capable of handling as many
different types of code like the 1 Power PC Core available on the
PS3's cell or the 3 Power PC Cores available on the 360's CPU. The
problem with being asymmetric is when you program for the Power PC Core
on the PS3 CPU, the method of programming you used to get the most out
of that Power PC core is no longer effective when breaking off tasks
for the SPE to work on. Going from the PPE to the SPE on the PS3
requires a different compiler and a different set of tools.
When you come to the realization that the key to making up for the CPU
is in-order execution is the rather complicated parallel programming,
you realize that the CPU being asymmetric and having just a single PPE
makes something that was already extremely difficult even more
difficult. So a developer's job is harder when you factor in that the
PS3 has a 512KB L2 cache which is half the size of the 360 CPU's 1MB
L2 cache... that single PPE the PS3 CPU has isn't receiving much help
with branches in the cache department.
Microsoft made a better decision from the perspective of the developer;
it's still difficult, but much easier compared to working with the Cell
architecture. The 360's CPU isn't asymmetric like the PS3's cell
and has 3 PPE as opposed to 1, but all 3 are robust enough to help
handle the type of code only the PS3's single PPE is capable of
handling. When Microsoft says they have three times the general purpose
processing power this is what they mean. Based on the simple fact that
the 360 has 3 Power PC cores to the PS3's 1, more processing power
can be dedicated to helping with things such as game control AI,
scripting and other types of branch intensive code.
advantage is that all 3 of the 360's cores are identical, all runFrom the perspective of a developer the 360's CPU's biggest
from the same memory pool and they're synchronized, in addition to
being cache coherent. You can just create an extra thread right in your
program and have it do some work. This allows the developer to create
very nice structures so if you know how to get the best possible
performance out of one core you know how to get the best possible
performance out of all 3 because they operate in perfect synch.
Each core on the 360's processor is capable of performing 2 threads
each (Think of it as similar to hyper threading), so the 360's CPU is
capable of handling 6 simultaneous running threads at once. This brings
me to a very important advantage for the PS3's Cell CPU, its
concurrency. While the 360 CPU may be able to handle 6 processor
threads simultaneously it still only has 3 physical CPU cores so every
2 threads must share processing power on a single core. Whereas with
the PS3, it has 1 PPE and 6 SPE for games, which are like extra
physical processors). If each of the PS3's 6 SPE used for games are
working on a specific task such as collision, cloth physics, animation,
water surface simulation or particles, they wouldn't need to worry
about processing power being taken away from another part of the game
because the SPE don't share processing power.
The only cause for concern would be the 512KB L2 cache being shared by
7 simultaneous running SPE and a PPE, but that's what developers are
for; they work around things like this. In practice, this should allow
PS3 games to potentially have more things going on at once than 360
games. Ignoring the difficulties of programming for the PS3 CPU, it
should be known that the PS3's CPU is very good when it comes to
vertex-related operations because the PS3's CPU handles graphics code
better than the 360's CPU. It is also possible that through good
parallelism of physics code on the SPE that physics code could also run
better on the PS3 CPU due to the concurrency advantage.
The 360 CPU however, due to its 3 symmetric General Purpose Cores, is
not only much easier to program for than the cell, but having 3 PPE
capable of handling things such as AI also means the 360's CPU will
be the better of the 2 CPUs when it comes to AI code. Either way we can
look forward to great things from both CPUs in the future.
Before I end off, I'd like to point out a game that in my opinion,
from a technical standpoint, is one the most brilliant uses of the
PS3's CPU. All things considered, such as in-order execution and the
other complications of the architecture, Heavenly Sword is quite the
standout in nearly every regard: incredible combat animations, awesome
group enemy AI, and great physics. At the very least this is what I
gathered from seeing videos of the E3 demo; it's a reminder that
regardless of the challenges, there are developers that are up to the
challenge and its only going to get better with time."
from pages 7, 8, 9
RSX (PS3GPU) & Xenos (360GPU)
Alright let's get underway the GPU inside the PS3 is NV47 based which
is another name for the 7800GTX. It has 24 pixel shader pipelines and 8
vertex shader pipelines. It's capable of 136 shader operations per
clock and according to Sony it has 256MB of GDDR3 memory at 700MHZ and
performs 74.8 billion shader operations per second. Sony also said
it's capable of 1.8 teraflops, which I can tell everyone right now
with 100% confidence isn't true (numbers game) I'm not entirely
sure of all the little tricks they used to arrive at such an extreme
flops number, but rest assured it isn't a type of a performance this
GPU will ever really achieve. PC videocards such as the X1900XTX have
far more raw horsepower than either of the 2 videocards in either
console and is pushing a GPU clock speed of up to 650MHZ (some have
shipped at 675MHZ) along with 24 more pixel shader pipelines and yet
the X1900XTX is just over 500GFLOPS so to even begin entertaining the
thought that a less advanced GPU with significantly less raw power
could brute force 1.3 teraflops better performance is wishful thinking,
but there is no cause to be angry at Sony in this case as they are
entitled to market their product regardless of how they choose to do
it. As long as they avoid disturbingly untrue statements about the
competition its all fair game as far as I'm concerned)
I'm sure some people are wondering how Sony came to the conclusion
that the RSX does 136 shader operations per clock or even 74.8 billion
shader ops per second? Easy
# The RSX has 24 pixel pipes (each of which performs 5.7 ops) 5.7ops
*24 Pixel Pipelines=136.8 shader ops per clock.
# The RSX is clocked at 550MHZ *136 shader ops per clock =74800 (or
74,800,000,000)
There is talk and even an event which took place in Japan in which Sony
attended claiming that the RSX will no longer be 550MHZ and it will
instead be clocked at 500MHZ and the 256MB of GDDR3 will now be @650MHZ
instead of 700. Now there is a lot pointing to this being true, but
Sony still hasn't officially come out and admitted so I'm not sure
what to think, but this is a perfect opportunity to see if we learned
how to calculate this stuff.
If the RSX is clocked at 500MHZ*136 shader ops per clock that would
make the new shader operations per second for the RSX 68 billion
instead of the original 74.8 billion weakening the GPU's performance,
but I guess we wont truly find out till the PS3 releases because if
anyone has noticed Sony has never posted the RSX clockspeed on the
official ps3 site nor did they re-iterate the RSX clockspeed at E3 06.
The RSX has 20.8GB/s of video memory bandwidth from the GDDR3 ram. The
RSX has an extra 32 GB/sec writing to the system's main memory. If the
RSX can fully utilize the memory system it can achieve pushing out
58.2GB/s worth of pixel rendering to memory. The RSX is pretty much a
7800GTX class GPU in some cases its worse in some cases better, nothing
that is really new. Now the same can't be said about the 360's GPU
at all.
Now the 360's GPU is one impressive piece of work and I'll say from
the get go it's much more advanced than the PS3's GPU so I'm not
sure where to begin, but I'll start with what Microsoft said about
it. Microsoft said Xenos was clocked at 500MHZ and that it had 48-way
parallel floating-point dynamically-scheduled shader pipelines (48
unified shader units or pipelines) along with a polygon performance of
500 Million triangles a second.
Before going any further I'll clarify this 500 Million Triangles a
second claim. Can the 360's GPU actually achieve this? Yes it can,
BUT there would be no pixels or color at all. It's the triangle setup
rate for the GPU and it isn't surprising it has such a higher
triangle setup rate due to it having 48 shaders units capable of
performing vertex operations whereas all other released GPUs can only
dedicate 8 shader units to vertex operations. The PS3 GPU's triangle
setup rate at 550MHZ is 275 million a second and if its 500MHZ will
have 250 million a second. This is just the setup rate do NOT expect to
see games with such an excessive number of polygons because it wont
happen.
Microsoft also says it can also achieve a pixel-fillrate of
16Gigasamples per second. This GPU here inside the Xbox 360 is
literally an early ATI R600, which when released by ATI for the pc will
be a Directx 10 GPU. Xenos in a lot of areas manages to meet many of
the requirements that would qualify it as a Directx 10 GPU, but falls
short of the requirements in others. What I found interesting was
Microsoft said the 360's GPU could perform 48 billion shader
operations per second back in 2005. However Bob Feldstein, VP of
engineering for ATI, made it very clear that the 360's GPU can
perform 2 of those shaders per cycle so the 360's GPU is actually
capable of 96 billion shader operations per second.
To quote ATI on the 360's GPU they say.
"On chip, the shaders are organized in three SIMD engines with 16
processors per unit, for a total of 48 shaders. Each of these shaders
is comprised of four ALUs that can execute a single operation per
cycle, so that each shader unit can execute four floating-point ops per
cycle."
# 48 shader units * 4 ops per cycle = 192 shader ops per clock
# Xenos is clocked at 500MHZ *192 shader ops per clock = 96 billion
shader ops per second.
(Did anyone notice that each shader unit on the 360's GPU doesn't
perform as many ops per pipe as the rsx? The 360 GPU makes up for it by
having superior architecture, having many more pipes which operate more
efficiently and along with more bandwidth.)
Did Microsoft just make a mistake or did they purposely misrepresent
their GPU to lead Sony on? The 360's GPU is revolutionary in the
sense that it's the first GPU to use a Unified Shader architecture.
According to developers this is as big a change as when the vertex
shader was first introduced and even then the inclusion of the vertex
shader was merely an add-on not a major change like this. The 360's
GPU also has a daughter die right there on the chip containing 10MB of
EDRAM. This EDRAM has a framebuffer bandwidth of 256GB/s which is more
than 5 times what the RSX or any GPU for the pc has for its framebuffer
(even higher than G80's framebuffer).
Thanks to the efficiency of the 360 GPU's unified shader architecture
and this 10MB of EDRAM the GPU is able to achieve 4XFSAA at no
performance cost. ATI and Microsoft's goal was to eliminate memory
bandwidth as a bottleneck and they seem to have succeeded. If there are
any pc gamers out there they notice that when they turn on things such
as AA or HDR the performance goes down that's because those features
eat bandwidth hence the efficiency of the GPU's operation decreases
as they are turned on. With the 360 HDR+4XAA simultaneously are like
nothing to the GPU with proper use of the EDRAM. The EDRAM contains a
3D logic unit which has 192 Floating Point Unit processors inside. The
logic unit will be able to exchange data with the 10MB of RAM at 2
Terabits a second. Things such as antialiasing, computing z depths or
occlusion culling can happen on the EDRAM without impacting the GPU's
workload.
Xenos writes to this EDRAM for its framebuffer and it's connected to
it via a 32GB/sec connection (this number is extremely close to the
theoretical because the EDRAM is right there on the 360 GPU's
daughter die.) Don't forget the EDRAM has a bandwidth of 256GB/s and
its only by dividing this 256GB/s by the initial 32GB/s that we get
from the connection of Xenos to the EDRAM we find out that Xenos is
capable of multiplying its effective bandwidth to the frame buffer by a
factor of 8 when processing pixels that make use of the EDRAM, which
includes HDR or AA and other things. This leads to a maximum of
32*8=256GB/s which, to say the least, is a very effective way of
dealing with bandwidth intensive tasks.
In order for this to be possible developers would need to setup their
rendering engine to take advantage of both the EDRAM and the available
onboard 3D logic. If anyone is confused why the 32GB/s is being
multiplied by 8 its because once data travels over the 32GB/s bus it is
able to be processed 8 times by the EDRAM logic to the EDRAM memory at
a rate of 256GB/s so for every 32GB/s you send over 256GB/s gets
processed. This results in RSX being at a bandwidth disadvantage in
comparison to Xenos. Needless to say the 360 not only has an
overabundance of video memory bandwidth, but it also has amazing memory
saving features. For example to get 720P with 4XFSAA on traditional
architecture would require 28MB worth of memory. On the 360 only 16MB
is required. There are also features in the 360's Direct3D API where
developers are able to fit 2 128x128 textures into the same space
required for one, for example. So even with all the memory and all the
memory bandwidth, they are still very mindful of how it's used.
I wasn't too clear earlier on the difference between the RSX's
dedicated pixel and vertex shader pipelines compared to the 360s
unified shader architecture. The 360 GPU has 48 unified pipelines
capable of accepting either pixel or vertex shader operations whereas
with the older dedicated pixel and vertex pipeline architecture that
RSX uses when you are in a vertex heavy situation most of the 24 pixel
pipes go idle instead of helping out with vertex work.
Or on the flip side in a pixel heavy situation those 8 vertex shader
pipelines are just idle and don't help out the pixel pipes (because
they aren't able to), but with the 360's unified architecture in a
vertex heavy situation for example none of the pipes go idle. All 48
unified pipelines are capable of helping with either pixel or vertex
shader operations when needed so as a result efficiency is greatly
improved and so is overall performance. When pipelines are forced to go
idle because they lack the capability to help another set of pipelines
accomplish their task it's detrimental to performance. This
inefficient manner is how all current GPUs operate including the PS3's
RSX. The pipelines go idle because the pixel pipes aren't able to help
the vertex pipes accomplish a task or vice versa. Whats even more
impressive about this GPU is it by itself determines the balance of how
many pipelines to dedicate to vertex or pixel shader operations at any
given time a programmer is NOT needed to handle any of this the GPU
takes care of all this itself in the quickest most efficient way
possible. 1080p is not a smart resolution to target in any form this
generation, but if 360 developers wanted to get serious about 1080p,
thanks to Xenos, could actually outperform the ps3 in 1080p. (The less
efficient GPU always shows its weaknesses against the competition in
higher resolutions so the best way for the rsx to be competitive is to
stick to 720P) In vertex shader limited situations the 360's gpu will
literally be 6 times faster than RSX. With a unified shader
architecture things are much more efficient than previous architectures
allowed (which is extremely important). The 360's GPU for example is
95-99% efficient with 4XAA enabled. With traditional architecture there
are design related roadblocks that prevent such efficiency. To avoid
such roadblocks, which held back previous hardware, the 360 GPU design
team created a complex system of hardware threading inside the chip
itself. In this case, each thread is a program associated with the
shader arrays. The Xbox 360 GPU can manage and maintain state
information on 64 separate threads in hardware. There's a thread buffer
inside the chip, and the GPU can switch between threads instantaneously
in order to keep the shader arrays busy at all times.
Want to know why Xenos doesn't need as much raw horsepower to
outperform say something like the x1900xtx or the 7900GTX? It makes up
for not having as much raw horsepower by actually being efficient
enough to fully achieve its advertised performance numbers which is an
impressive feat. The x1900xtx has a peak pixel fillrate of
10.4Gigasamples a second while the 7900GTX has a peak pixel fillrate of
15.6Gigasamples a second. Neither of them is actually able to achieve
and sustain those peak fillrate performance numbers though due to not
being efficient enough, but they get away with it in this case since
they can also bank on all the raw power. The performance winner between
the 7900GTX and the X1900XTX is actually the X1900XTX despite a lower
pixel fillrate (especially in higher resolutions) because it has twice
as many pixel pipes and is the more efficient of the 2. It's just a
testament as to how important efficiency is. Well how exactly can the
mere 360 GPU stand up to both of those with only a 128 bit memory
interface and 500MHZ? Well the 360 GPU with 4XFSAA enabled achieves AND
sustains its peak fillrate of 16Gigasamples per second which is
achieved by the combination of the unified shader architecture and the
excessive amount of bandwidth which gives it the type of efficiency
that allows it to outperform GPUs with far more raw horsepower. I guess
it also helps that it's the single most advanced GPU currently
available anyway for purchase. Things get even better when you factor
in the Xenos' MEMEXPORT ability which allows it to enable
"streamout" which opens the door for Xenos to achieve DX10 class
functionality. A shame Microsoft chose to disable Xenos' other 16
pipelines to improve yields and keep costs down. Not many are even
aware that the 360's GPU has the exact same number of pipelines as
ATI's unreleased R600, but to keep costs down and to make the GPU
easier to manufacture, Microsoft chose to disable one of the shader
arrays containing 16 pipelines. What MEMEXPORT does is it expands the
graphics pipeline in more general purpose and programmable manner.
I'll borrow a quote from Dave Baumann since he explains it rather
well.
"With the capability to fetch from anywhere in memory, perform
arbitrary ALU operations and write the results back to memory, in
conjunction with the raw floating point performance of the large shader
ALU array, the MEMEXPORT facility does have the capability to achieve a
wide range of fairly complex and general purpose operations; basically
any operation that can be mapped to a wide SIMD array can be fairly
efficiently achieved and in comparison to previous graphics pipelines
it is achieved in fewer cycles and with lower latencies. For instance,
this is probably the first time that general purpose physics
calculation would be achievable, with a reasonable degree of success,
on a graphics processor and is a big step towards the graphics
processor becoming much more like a vector co-processor to the CPU."
Even with all of this information there is still a lot more about this
GPU that ATI just simply isn't revealing and considering they'll be
borrowing technology used to design this GPU in their future pc
products can you really blame them?
from page 11
Conclusion
Hopefully this article has helped to dispell some rumors surrounding
the processing power of these two great consoles and demonstrate some
of the differences that give them their unique feel. There are many
attempts on both sides to distort the numbers or misconstrue their
importance, but looking at the features as a whole allows the
opportunity to determine how these consoles will operate overall. While
both consoles shine in some areas, they do have their softer spots.
Ultimately, the good features of each of these consoles outweigh the
bad and the amount of high quality games being released this winter
will give Sony and Microsoft fans alike a lot to be happy about.