ATI XENOS (X360 GPU) summary

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Summary by GameMaster at Teamxbox forums
http://forum.teamxbox.com/showthread.php?p=5465848&highlight=anisotropic#post5465848


*XENOS is a "Split Processor" GPU, meaning that it is actually 2 GPU cores
that is packaged together with the "Parent" GPU handling the majority of
shader tasks and acting as the "North Bridge" for the system, among other
things. The "Daughter" GPU is directly linked to the "Parent" GPU and this
is the module that has the 10MB of eDRAM. There is a considerable amount of
additional logic on the "Daughter" GPU that will process a number of things
such as HDR, 4xMSAA/FSAA, Z Buffer (Depth), Alpha Buffer (Transparency),
Stencil Buffer (Shadows), Occusion Culling (Removing unseen polygons),
Radiosity Lighting (such as Global Illumination), Real Time LOD (Level of
Detail/Tessellation), and something that ATI refers to as "Fluid Reality"
which is basically material physics such as hair, clothing, and water. All
of that without burdening the "Parent" GPU and saving memory bandwidth at
the same time since these tasks can be performed on the eDRAM.

*XENOS's Parent GPU has 232 million transistors and the Daughter GPU has 150
million transistors (80 million is for the eDRAM), for a grand total of
around 382 million transistors. XENOS's "Parent" GPU is manufactured by TMSC
using their .09nm manufacturing process and the "Daughter" GPU is
manufactured by NEC using their .09nm manufacturing process. The "Split
Processor" design allows XENOS to improve yeild during manufacturing and
also helps with heat output/power comsumption issues.

*XENOS uses deferred tile based rendering (some of you would be familar with
this as the Dreamcast used this rendering technique). This is how they will
be able to process high resolution displays with 4xFSAA active and there are
some additional performance enhancing technologies that will take advantage
of the tile based rendering.

*XENOS contains 16 texture fetch and 16 vertex fetch units. Each of the
texture units have bilinear sampling capacity per clock and if trilinear or
anisotropic filtering, each unit will loop itself through multiple samples
so the target sampling and filtering level is complete (Basically this means
there is less performance loss when you are using trilinear or anisotropic
texture filtering). These are done OUTSIDE of the shader units and improves
performance as this increases efficiency.

*XENOS is capable of processing 64 threads simultaneously, this is to make
sure that all elements are being utilized and so there is minimal or no
stalling of the graphics architecture. So even if a ALU may be waiting for a
texture sample to be achieved, that thread would not stall the ALU as it
would be working on something else from another thread. This effectively
hides tasks that would normally have a large latency penalty attached to
them. ATI suggests that their testing achieves an average of 95% efficiency
of the shader array in general purpose graphics usage conditions. The
throughput is said to be two loops, two texture instructions, 6 ALU
instructions, per pixel, per cycle at Xeno's peak fill rate.

*XENOS has 48 ALUs that are 16-way, and are grouped into 3 arrays of SIMD
ALUs. Each ALU can co-issue a Vector4 and a scalar instruction
simultaneously, essentially a "5D" operation per cycle (basically 2 Vec4 and
2 scalar instructions per cycle per ALU). The ALUs process everything in
FP32 precision with no internal partial precision requirements for FP16.
Additionally each of the 48 ALUs contains additional logic that performs all
the pixel shader interpolation calculations. ATI suggests that this would
basically equates to an extra 33% pixel shader computional capacity.

*Developers can choose to allow XENOS to automatically handle load balancing
of the ALUs for their applications or take direct control of the ALUs. The
load balancing is based on a algorithm that affects prioritization of the
vertex and pixel shader programs. ATI believes that the algorithm gives very
optimal throughput and expect only a few developers to actually look into
changing the weightings of the algorithm. They also state that there will
never be an unused shader array or texture sampler if there are threads
available to use it.

*XENOS capabilities... 4K instruction slots (shared between VS and PS),
greater than 500K maximum number of instructions executed, has instruction
prediction, 64 temporary registers, 512 consant registers (shared between VS
and PS), has static flow control, has dynamic flow control, had a 4 dynamic
flow control depth or 2^23 if nesting, has vertex texture fetch (dependant
fetches and all formats), 32 surface shared pool where textures consumes 1
entry and vertex consumes 1/3 of a entry so maximum of 32 texture or 96
vertex, has geometry instancing, has no dependant texture limits or texture
instruction limits, has position registers, has 16 interpolated registers,
has arbitrary swizzling, has gradient instructions, has loop count
registers, and has face registers (2 sided lighting). What does all that
mean? Don't ask... it would take too long to describe everything, but all
this does mean it EXCEEDS VS3.0 and PS3.0 specifications.

*XENOS has a something called "MEMEXPORT" which will be important for shader
programs that exceed 4000 instructions, but that is only the start of this
particular beauty. It would take me too long to describe this feature in
this post, but developers will absolutely love this feature...

*XENOS is capable of processing a displacement map in a single pass (this
basically gives free additional geometry for the object).

....and a lot more than the 10 item limit that was requested by the earlier
poster. Bottom line, XENOS is both POWERFUL and EFFICIENT... now what
happens when you combine something that is powerful AND efficient? More
later...
 
Back
Top