T
Tony Hill
Hi all
I came across a rather interesting set of numbers a little while back
and was wondering if anyone had any thoughts on the matter. To start
with, here are the benchmarks (part of a Prescott review):
http://techreport.com/reviews/2004q1/p4-prescott/index.x?pg=14
In particular have a look at the BLAS DGEMM numbers for
double-precision floating point. The P4 scores are mostly what you
might expect: compiled C x87 code is fairly slow, assembly code for
x87 is about a fair bit faster and SSE2 vectorized code is a lot
faster again.
Starting to look at the Athlon64's performance things are mostly
fairly normal as well. SSE2 vector performance is lower than the
P4's, but given that the Athlon64 and the P4 have the exact same
maximum theoretical performance per clock and the P4 runs at much
higher clock speeds this is to be expected. The Athlon64 shows rather
impressive compiled C code (significantly faster than the P4 here),
but again this isn't too surprising, especially given that the tests
are compiled with Microsoft's Visual.Net compiler rather than Intel's
C compiler.
Where things really get a bit odd though is with the SSE2 scalar code.
The simple expectation would be that SSE2 scalar code should perform
at roughly half the speed of SSE2 vector code, give or take a bit for
memory subsystem issues. However, the numbers are REAL different
here. On the P4 system SSE2 scalar code performs at only about 1/3 of
the SSE2 vector code and it's even slower than x87 C code.
On the Athlon64 it's a TOTALLY different story. Here SSE2 scalar code
performs exactly on-par with the SSE2 vector code. x87 assembly code
also offers essentially identical performance. This doesn't seem to
be a case of the thing being bandwidth limited as the Athlon64 3400+
(64-bit memory interface) is within 3% of the performance of the
equal-clock speed Athlon64 FX 51 (128-bit memory interface). It might
be some sort of cache bandwidth limited, though my understanding of
the test is that the working set is 18MB in size, so that should blow
all caches out of the picture (someone feel free to correct me if I'm
wrong on this one).
So, basically the question for you all to ponder is simply what is the
difference between AMD and Intel's SSE2 scalar implementation? This
could have some VERY interesting implications for one important
reason: In AMD64 Long Mode (64-bit code), AMD has specifically stated
that x86, 3DNow! and MMX are deprecated in favor of SSE2 code.
According to AMD's vision of things, ALL FPU code should always be
handled the SSE2 unit. Given that the bulk of x86 floating point code
in existence today is scalar code (for better or for worse) that could
mean that SSE2 scalar performance could have a MAJOR impact on a lot
of applications.
I came across a rather interesting set of numbers a little while back
and was wondering if anyone had any thoughts on the matter. To start
with, here are the benchmarks (part of a Prescott review):
http://techreport.com/reviews/2004q1/p4-prescott/index.x?pg=14
In particular have a look at the BLAS DGEMM numbers for
double-precision floating point. The P4 scores are mostly what you
might expect: compiled C x87 code is fairly slow, assembly code for
x87 is about a fair bit faster and SSE2 vectorized code is a lot
faster again.
Starting to look at the Athlon64's performance things are mostly
fairly normal as well. SSE2 vector performance is lower than the
P4's, but given that the Athlon64 and the P4 have the exact same
maximum theoretical performance per clock and the P4 runs at much
higher clock speeds this is to be expected. The Athlon64 shows rather
impressive compiled C code (significantly faster than the P4 here),
but again this isn't too surprising, especially given that the tests
are compiled with Microsoft's Visual.Net compiler rather than Intel's
C compiler.
Where things really get a bit odd though is with the SSE2 scalar code.
The simple expectation would be that SSE2 scalar code should perform
at roughly half the speed of SSE2 vector code, give or take a bit for
memory subsystem issues. However, the numbers are REAL different
here. On the P4 system SSE2 scalar code performs at only about 1/3 of
the SSE2 vector code and it's even slower than x87 C code.
On the Athlon64 it's a TOTALLY different story. Here SSE2 scalar code
performs exactly on-par with the SSE2 vector code. x87 assembly code
also offers essentially identical performance. This doesn't seem to
be a case of the thing being bandwidth limited as the Athlon64 3400+
(64-bit memory interface) is within 3% of the performance of the
equal-clock speed Athlon64 FX 51 (128-bit memory interface). It might
be some sort of cache bandwidth limited, though my understanding of
the test is that the working set is 18MB in size, so that should blow
all caches out of the picture (someone feel free to correct me if I'm
wrong on this one).
So, basically the question for you all to ponder is simply what is the
difference between AMD and Intel's SSE2 scalar implementation? This
could have some VERY interesting implications for one important
reason: In AMD64 Long Mode (64-bit code), AMD has specifically stated
that x86, 3DNow! and MMX are deprecated in favor of SSE2 code.
According to AMD's vision of things, ALL FPU code should always be
handled the SSE2 unit. Given that the bulk of x86 floating point code
in existence today is scalar code (for better or for worse) that could
mean that SSE2 scalar performance could have a MAJOR impact on a lot
of applications.