AMD SSE2 scalar performance

  • Thread starter Thread starter Tony Hill
  • Start date Start date
T

Tony Hill

Hi all

I came across a rather interesting set of numbers a little while back
and was wondering if anyone had any thoughts on the matter. To start
with, here are the benchmarks (part of a Prescott review):

http://techreport.com/reviews/2004q1/p4-prescott/index.x?pg=14

In particular have a look at the BLAS DGEMM numbers for
double-precision floating point. The P4 scores are mostly what you
might expect: compiled C x87 code is fairly slow, assembly code for
x87 is about a fair bit faster and SSE2 vectorized code is a lot
faster again.

Starting to look at the Athlon64's performance things are mostly
fairly normal as well. SSE2 vector performance is lower than the
P4's, but given that the Athlon64 and the P4 have the exact same
maximum theoretical performance per clock and the P4 runs at much
higher clock speeds this is to be expected. The Athlon64 shows rather
impressive compiled C code (significantly faster than the P4 here),
but again this isn't too surprising, especially given that the tests
are compiled with Microsoft's Visual.Net compiler rather than Intel's
C compiler.

Where things really get a bit odd though is with the SSE2 scalar code.
The simple expectation would be that SSE2 scalar code should perform
at roughly half the speed of SSE2 vector code, give or take a bit for
memory subsystem issues. However, the numbers are REAL different
here. On the P4 system SSE2 scalar code performs at only about 1/3 of
the SSE2 vector code and it's even slower than x87 C code.

On the Athlon64 it's a TOTALLY different story. Here SSE2 scalar code
performs exactly on-par with the SSE2 vector code. x87 assembly code
also offers essentially identical performance. This doesn't seem to
be a case of the thing being bandwidth limited as the Athlon64 3400+
(64-bit memory interface) is within 3% of the performance of the
equal-clock speed Athlon64 FX 51 (128-bit memory interface). It might
be some sort of cache bandwidth limited, though my understanding of
the test is that the working set is 18MB in size, so that should blow
all caches out of the picture (someone feel free to correct me if I'm
wrong on this one).


So, basically the question for you all to ponder is simply what is the
difference between AMD and Intel's SSE2 scalar implementation? This
could have some VERY interesting implications for one important
reason: In AMD64 Long Mode (64-bit code), AMD has specifically stated
that x86, 3DNow! and MMX are deprecated in favor of SSE2 code.
According to AMD's vision of things, ALL FPU code should always be
handled the SSE2 unit. Given that the bulk of x86 floating point code
in existence today is scalar code (for better or for worse) that could
mean that SSE2 scalar performance could have a MAJOR impact on a lot
of applications.
 
Tony Hill said:
So, basically the question for you all to ponder is simply what is the
difference between AMD and Intel's SSE2 scalar implementation? This
could have some VERY interesting implications for one important
reason: In AMD64 Long Mode (64-bit code), AMD has specifically stated
that x86, 3DNow! and MMX are deprecated in favor of SSE2 code.
According to AMD's vision of things, ALL FPU code should always be
handled the SSE2 unit. Given that the bulk of x86 floating point code
in existence today is scalar code (for better or for worse) that could
mean that SSE2 scalar performance could have a MAJOR impact on a lot
of applications.

It's likely that in AMD's floating point implementation, all floating point
calcs (regardless whether it is x87, 3DNow, or SSE) go through the same
pipeline. Whereas in the P4, with its funky micro-ops conversion mechanism
each type of instruction goes through a separate pipeline for at least a
part of its journey.

Yousuf Khan
 
It's likely that in AMD's floating point implementation, all floating point
calcs (regardless whether it is x87, 3DNow, or SSE) go through the same
pipeline. Whereas in the P4, with its funky micro-ops conversion mechanism
each type of instruction goes through a separate pipeline for at least a
part of its journey.

I did a bit more research and found the following on Ace's Hardware
message forum, posted by Gipsel:

<quoting>
The Athlon uses the FPU units for SSE2, the reason the performance is
the same with vector and scalar instructions (and theoretical also
x87), you can always do 2 FLOPs per cycle. You have to realize the
Athlon64 can't do vector SSE2 directly. A vector SSE2 instruction is
broken down to 2 scalar MacroOps.
The P4 is different, the core has only one issue port for FP
calculations (Athlon has two). This port has to be shared between all
x87 and SSE/2 instructions. That means you have a different limit for
scalar SSE2/x87 and vector SSE2 instructions of 1 and 2 FLOPs per
cycle.
<end quote>

So basically yes, what you were saying is more or less right-on. This
does explain some things, in particular how AMD's SSE2 scalar and SSE2
vector performance should offer pretty similar performance in a BLAS
type application. It still doesn't quite explain the P4's relatively
poor scalar SSE2 performance. From the above it's scalar SSE2 BLAS
algorithm should perform at roughly half the speed of it's SSE2 vector
algorithm. However in this test it ended up that the vector code was
at least 2.7 times faster for the Northwood and up to 3.3 times faster
with the Prescott.

Anyway, I guess the big question will end up being whether or not
x86-64 long mode ends up ONLY using SSE2 for floating point or not.
 
Tony Hill said:
So basically yes, what you were saying is more or less right-on. This
does explain some things, in particular how AMD's SSE2 scalar and SSE2
vector performance should offer pretty similar performance in a BLAS
type application. It still doesn't quite explain the P4's relatively
poor scalar SSE2 performance. From the above it's scalar SSE2 BLAS
algorithm should perform at roughly half the speed of it's SSE2 vector
algorithm. However in this test it ended up that the vector code was
at least 2.7 times faster for the Northwood and up to 3.3 times faster
with the Prescott.

Sounds like it has something to do with the pipeline stages in the P4.
Namely, there's likely a stage where all SSE vector operations are pipelined
together into a single continuous sequence. Not just one vector, but as many
vectors as are coming in a row. A Northwood could only have a small number
of these vector operations in flight at once, whereas Prescott with its 50%
bigger pipeline stages may be able to string together a larger number of the
operations into a row.
Anyway, I guess the big question will end up being whether or not
x86-64 long mode ends up ONLY using SSE2 for floating point or not.

Sounds like Microsoft's OS is turning off the x87 unit, allowing only SSE
through. The Linux boys seem to not have enforced this, and so they allow
everything to go through. Not sure why Microsoft is doing this, as the
instructions to save and restore x87 through to SSE3 registers is the same
FXSAVE and FXRESTOR commands. It doesn't take any extra instructions to save
x87 status than it does to save SSE status.

Anyways, I'll be out of town for the next month. I'll try and see what
discussions are playing here off and on again, but if I don't then I'll see
you guys after next month.

Yousuf Khan
 
Sounds like it has something to do with the pipeline stages in the P4.
Namely, there's likely a stage where all SSE vector operations are pipelined
together into a single continuous sequence. Not just one vector, but as many
vectors as are coming in a row. A Northwood could only have a small number
of these vector operations in flight at once, whereas Prescott with its 50%
bigger pipeline stages may be able to string together a larger number of the
operations into a row.

That could help explain why the Prescott did better than the Northwood
in SSE2 vector operations, but it doesn't suggest at all why it does
WORSE on SSE2 scalar operations.

I did a quick search through the Intel optimization guide but didn't
come up with much. The only thing that struck me as possible is that
in SSE2 vector code they might be doing everything as: add -> multiply
-> add -> multiply, etc. while in SSE2 scalar they are doing: add ->
add -> multiply -> multiply. This seems like a bit of a trivial
optimization, and the author of the benchmark (Tim Wilkens) seems like
a fairly smart cookie, so I would guess that he would have thought of
this.

Everything else on Intel's optimization guides seems to point to SSE2
scalar being able to run at half the speed of SSE2 vector for this
sort of application, and at the very least it should be faster than
x87 compiled code. Several times in their guide they suggest using
SSE2 scalar code instead of x87 code as the former should be faster
unless you need specific x87 functionality (which a basic BLAS
implementation wouldn't AFAIK).
Sounds like Microsoft's OS is turning off the x87 unit, allowing only SSE
through. The Linux boys seem to not have enforced this, and so they allow
everything to go through. Not sure why Microsoft is doing this, as the
instructions to save and restore x87 through to SSE3 registers is the same
FXSAVE and FXRESTOR commands. It doesn't take any extra instructions to save
x87 status than it does to save SSE status.

I don't know quite what's going on here. I also wonder if Intel's new
x86-64 implementation will throw a monkey wrench into any of this. As
best as I can tell Intel is NOT specifying that SSE2 is the one true
floating point unit in 64-bit mode. In fact, they don't seem to make
ANY mention of this possibility at all.
Anyways, I'll be out of town for the next month. I'll try and see what
discussions are playing here off and on again, but if I don't then I'll see
you guys after next month.

Have a good trip!
 
Tony Hill said:
That could help explain why the Prescott did better than the Northwood
in SSE2 vector operations, but it doesn't suggest at all why it does
WORSE on SSE2 scalar operations.

I did a quick search through the Intel optimization guide but didn't
come up with much. The only thing that struck me as possible is that
in SSE2 vector code they might be doing everything as: add -> multiply
-> add -> multiply, etc. while in SSE2 scalar they are doing: add ->
add -> multiply -> multiply. This seems like a bit of a trivial
optimization, and the author of the benchmark (Tim Wilkens) seems like
a fairly smart cookie, so I would guess that he would have thought of
this.

If vector operations are just multiple back-to-back scalar operations
(or rather if you prefer, if scalar operations are just
one-dimensional vector operations) then a lot of micro-op instructions
can remain in flight when doing vector rather than scalar. So it does
you good to have a lot of instructions in flight on a P4 of any kind,
but even more so with Prescott with its bigger pipeline.
I don't know quite what's going on here. I also wonder if Intel's new
x86-64 implementation will throw a monkey wrench into any of this. As
best as I can tell Intel is NOT specifying that SSE2 is the one true
floating point unit in 64-bit mode. In fact, they don't seem to make
ANY mention of this possibility at all.

Intel isn't saying it, but Microsoft is. AMD's x87 was great, but
Intel's wasn't. AMD's SSE was great, and so was Intel's. So the lowest
common denominator rules here. What language runs great on both
machines?
Have a good trip!

Thanks, already am. It's warmer than Canada here. Greetings from
Bangladesh.

Yousuf Khan
 
If vector operations are just multiple back-to-back scalar operations
(or rather if you prefer, if scalar operations are just
one-dimensional vector operations) then a lot of micro-op instructions
can remain in flight when doing vector rather than scalar. So it does
you good to have a lot of instructions in flight on a P4 of any kind,
but even more so with Prescott with its bigger pipeline.

Regardless of whether you're using vectors or scalars BLAS should
always have a fairly constant stream of FP Adds and FP Mults, at least
according to my understanding of the benchmark. The test is just a
large 2D matrix that is being solved, so things like branches should
be more or less non-existent.

I suppose it could be an issue with getting the data from memory and
this is causing more stalls in scalar mode than vector mode. I dunno.
Intel isn't saying it, but Microsoft is.

Microsoft's line is directly along with what AMD is saying as well
though, Intel is the odd-man out here. Of course, as is usually the
case, I would imagine that MS will end up with the last word here.
AMD's x87 was great, but
Intel's wasn't. AMD's SSE was great, and so was Intel's. So the lowest
common denominator rules here. What language runs great on both
machines?

Ahh, but that's the question I was getting at here. It seems that
Intel's SSE2 implementation is NOT great, at least not when dealing
with scalar operations.
Thanks, already am. It's warmer than Canada here. Greetings from
Bangladesh.

Sounds nice! It has been reasonable warm here in Ottawa though...
mind you, it's also been cloudy and rainy. Reminds me of last winter
when I was in Ireland :>
 
Tony Hill said:
Regardless of whether you're using vectors or scalars BLAS should
always have a fairly constant stream of FP Adds and FP Mults, at least
according to my understanding of the benchmark. The test is just a
large 2D matrix that is being solved, so things like branches should
be more or less non-existent.

I suppose it could be an issue with getting the data from memory and
this is causing more stalls in scalar mode than vector mode. I dunno.

It's likely that the Intel implementation is always operating on the
full 128-bit register width whether it is using scalars or vectors, so
if you use scalars half of the register is wasted.
Microsoft's line is directly along with what AMD is saying as well
though, Intel is the odd-man out here. Of course, as is usually the
case, I would imagine that MS will end up with the last word here.

One would think that Microsoft is trying to aid Intel here by
preferring SSE over x87. I certainly can't see AMD objecting one way
or another whether Microsoft decided to prefer x87 over SSE -- it's
got an answer for either front.
Ahh, but that's the question I was getting at here. It seems that
Intel's SSE2 implementation is NOT great, at least not when dealing
with scalar operations.

Well, okay Intel's SSE implementation isn't consistently good, but
it's still better than its x87 implementation.

Yousuf Khan
 
It's likely that the Intel implementation is always operating on the
full 128-bit register width whether it is using scalars or vectors, so
if you use scalars half of the register is wasted.

Yup, that is the case for AMD's implementation as well. SSE/SSE2 will
always tend to operate better with vector operations than with scalar
stuff as long as your code is decently written for both.
One would think that Microsoft is trying to aid Intel here by
preferring SSE over x87. I certainly can't see AMD objecting one way
or another whether Microsoft decided to prefer x87 over SSE -- it's
got an answer for either front.

Well if this one BLAS test is of any indication, AMD's SSE2 scalar
performance is head and shoulders ahead of Intel's.
Well, okay Intel's SSE implementation isn't consistently good, but
it's still better than its x87 implementation.

In this particular test it wasn't as good as x87, that's the problem.
The SSE2 vector code was great and super-fast, but it's SSE2 scalar
code was quite slow, slower than even compiled x87 code and definitely
slower than x87 optimized assembly code. I can see no good reason for
this to be the case, and in fact Intel does say in all of their
optimization guides that SSE2 scalar code SHOULD be faster than x87
code. In any case, it may simply be that this test is a bit of an
anomaly.
 
Back
Top