How AMD will take on Intel Woodcrest: twice the FPU's

Yousuf Khan · Feb 24, 2006

AMD has a response to Intel Woodcrest server chips
"The first of these that we have heard about is the server variant, and
it will be a killer. It has 2x the floating point units, and sources
tell us that it will push about 1.5x the floating point performance of
the current chips in the real world."
http://www.theinquirer.net/?article=29890

nobody · Feb 24, 2006

AMD has a response to Intel Woodcrest server chips
"The first of these that we have heard about is the server variant, and
it will be a killer. It has 2x the floating point units, and sources
tell us that it will push about 1.5x the floating point performance of
the current chips in the real world."
http://www.theinquirer.net/?article=29890

It all looks fine and well except for one thing. What the servers
need a killer FPU for? AFAIK, traditional server tasks such as
database, web, mail, etc. have very little dependance on the float.
Say "workstation", and the FPU is what matters most, but server...
Am I missing something?

NNN

George Macdonald · Feb 24, 2006

AMD has a response to Intel Woodcrest server chips
"The first of these that we have heard about is the server variant, and
it will be a killer. It has 2x the floating point units, and sources
tell us that it will push about 1.5x the floating point performance of
the current chips in the real world."
http://www.theinquirer.net/?article=29890

I assume that means 2xSSE3 units. I'm not sure how that really improves
the traditional server market - e.g. file serving, database/TP etc. doesn't
do floating point per se and I've not heard of SSEx being a big play
there... though that could just be my ignorance showing. :-)

There *are* high expectations in some quarters just now for application
servers to (finally) take off, which could in some apps benefit from
improved FP and SSEx, but I still see that as somewhat speculative.

I really think AMD needs more than 2xFP for a umm, "killer".

David Kanter · Feb 25, 2006

AMD has a response to Intel Woodcrest server chips

I assume that means 2xSSE3 units. I'm not sure how that really improves
the traditional server market - e.g. file serving, database/TP etc. doesn't
do floating point per se and I've not heard of SSEx being a big play
there... though that could just be my ignorance showing.

AMD AFAIK doesn't have SSEn units for any n. They just decode SSE ops
into scalar instructions and execute them on the traditional FPUs.

There *are* high expectations in some quarters just now for application
servers to (finally) take off, which could in some apps benefit from
improved FP and SSEx, but I still see that as somewhat speculative.

I really think AMD needs more than 2xFP for a umm, "killer".

Yup. It will help for HPC stuff though, where K8 is already quite
popular.

DK

The little lost angel · Feb 25, 2006

It all looks fine and well except for one thing. What the servers
need a killer FPU for? AFAIK, traditional server tasks such as
database, web, mail, etc. have very little dependance on the float.
Say "workstation", and the FPU is what matters most, but server...
Am I missing something?

Assuming the report is reliable, I think you're missing the Marketing.

There are bound to be some hardcore gamers who will somehow put these
into gaming machines and throw up huge benchmarks putting AMD way
ahead of any Intel x86/64 equivalent at that time. Even though it does
not directly relate to server application, I'm quite sure it will have
some influence.

Also I'm not sure but wouldn't the FPU/SSE units be useful if
everything's being encrypted/SSLed?

Grumble · Feb 25, 2006

The said:
Also I'm not sure but wouldn't the FPU/SSE units be useful if
everything's being encrypted/SSLed?

AFAIU, encryption involves mostly integer arithmetic.

Grumble · Feb 25, 2006

George said:
I assume that means 2xSSE3 units.

The K8 optimization manual states:

"Future processors with more or wider multipliers and adders will
achieve better throughput using SSE and SSE2 instructions. (Today's
processors implement a 128-bit-wide SSE or SSE2 operation as two
64-bit operations that are internally pipelined.)

The SIMD instructions provide a theoretical single-precision peak
throughput of two additions and two multiplications per clock cycle,
whereas x87 instructions can only sustain one addition and one
multiplication per clock cycle. The SSE2 and x87 double-precision
peak throughput is the same, but SSE2 instructions provide better
code density."

Maybe AMD plans to boost the throughput of 128-bit operations?

George Macdonald · Feb 25, 2006

AMD AFAIK doesn't have SSEn units for any n. They just decode SSE ops
into scalar instructions and execute them on the traditional FPUs.

Hmm, "traditional" is a little misplaced don't you think? How things work
internally is not the important thing here but whether the SSEn, in
particular 2xFP, is a benefit in servers. OTOH, since it's the umm
Inquirer, it could also be that 2xFPs means that AMD will expand its
internal SSE FP ops out to a full effective 128bits.

Yousuf Khan · Feb 26, 2006

It all looks fine and well except for one thing. What the servers
need a killer FPU for? AFAIK, traditional server tasks such as
database, web, mail, etc. have very little dependance on the float.
Say "workstation", and the FPU is what matters most, but server...
Am I missing something?

Are you forgetting render farms? Those are all servers.

Yousuf Khan

Yousuf Khan · Feb 26, 2006

David said:
AMD AFAIK doesn't have SSEn units for any n. They just decode SSE ops
into scalar instructions and execute them on the traditional FPUs.

I'm surprised everybody is taking the word "FPU" so literally here. It's
obviously not referring to the traditional high-level x87 FPU, it's
referring to the low-level FPU here. You know FPU as in the counterpart
to the ALU? Low-level sections of the whole CPU. The low-level FPU would
be common to all of them: SSE, x87, 3DNow!, etc.

These days everything in the x86 world are just interfaces to more
intricate structures below.

Yousuf Khan

David Kanter · Feb 26, 2006

David said:
I'm surprised everybody is taking the word "FPU" so literally here. It's
obviously not referring to the traditional high-level x87 FPU, it's
referring to the low-level FPU here. You know FPU as in the counterpart
to the ALU? Low-level sections of the whole CPU. The low-level FPU would
be common to all of them: SSE, x87, 3DNow!, etc.

Actually a vector unit would be rather different from an FPU.
See...one natively executes vector instructions, with vector data, the
other FP operations on scalar data.

These days everything in the x86 world are just interfaces to more
intricate structures below.

So what? SSE units are different from FPUs are different from ALUs.

There would be a big difference. Try and think about how many FLOPs a
chip would have with 4 SSEn units...

DK

David Kanter · Feb 26, 2006

I can see better FP resources being appreciated by a fair number of
folks. Anyone doing content related stuff (sound, video, photo, etc.)
would certainly like that.

Will it appeal to webservers, database servers? Probably not. I sort
of get the idea the point is to appeal to gamers and to try to erode
IPF's advantages.

DK

George Macdonald · Feb 26, 2006

I can see better FP resources being appreciated by a fair number of
folks. Anyone doing content related stuff (sound, video, photo, etc.)
would certainly like that.

Will it appeal to webservers, database servers? Probably not. I sort
of get the idea the point is to appeal to gamers and to try to erode
IPF's advantages.

Yeah sure but the article referenced talked about the "server variant"
having 2xFP, in the context that future AMD CPUs would differ more from
each other according to market segment.

David Kanter · Feb 26, 2006

George said:
Yeah sure but the article referenced talked about the "server variant"
having 2xFP, in the context that future AMD CPUs would differ more from
each other according to market segment.

So if you follow these things, AMD is hypothetically going to have a
mobile chip and a server chip. They will then scale up the mobile or
scale down the server. My thoughts are that it's easier to do the
latter...

DK

Yousuf Khan · Feb 27, 2006

David said:
Actually a vector unit would be rather different from an FPU.
See...one natively executes vector instructions, with vector data, the
other FP operations on scalar data.

SSE is not vectored.

So what? SSE units are different from FPUs are different from ALUs.

There would be a big difference. Try and think about how many FLOPs a
chip would have with 4 SSEn units...

Or 4 FP units in general for that matter, right?

Yousuf Khan

David Kanter · Feb 27, 2006

SSE is not vectored.

Yes it is. It may not be a vector in the sense of doing
scatter-gather, and having thousands of elements, but that's not really
in definition of vector.

3DNow is just as much a vector extension and so is Altivec.

Or 4 FP units in general for that matter, right?

No, that's not really all that relevant. If you have 4 FPUs and you
want to use them, you need to:

1. Issue more instructions
2. Support more LD/ST pipes
3. Have more register and cache ports

That spells a brand new uarch and lots of effort.

If you want to do this the cheap and easy way, you simply increase the
throughput and decrease the latency of SSEn operations. Basically, you
would pretend that 2 FPUs are a single SSEn unit.

Alternatively, you could add more instructions like the TFP stuff that
was planned ages ago, before it got dumped by AMD.

I think Tejas was going the route of adding stuff like FMACs.

DK

Keith · Feb 28, 2006

Yes it is. It may not be a vector in the sense of doing
scatter-gather, and having thousands of elements, but that's not really
in definition of vector.

3DNow is just as much a vector extension and so is Altivec.

No, that's not really all that relevant. If you have 4 FPUs and you
want to use them, you need to:

1. Issue more instructions

Sure, and more SSE units wouldn't?

2. Support more LD/ST pipes

Not necessarily.

3. Have more register and cache ports

Sure, but this isn't an architectural issue. ...a simple matter id
implementation.

That spells a brand new uarch and lots of effort.

So? If there is a justification...

If you want to do this the cheap and easy way, you simply increase the
throughput and decrease the latency of SSEn operations. Basically, you
would pretend that 2 FPUs are a single SSEn unit.

Why? Sounds like a power-hog for no good reason. Aren't FPUs (at least
x87 style) going away?

Alternatively, you could add more instructions like the TFP stuff that
was planned ages ago, before it got dumped by AMD.

I wonder why? Perhaps because it's not as useful as its proponents
pretend? Is Intel making hay with them?

I think Tejas was going the route of adding stuff like FMACs.

Was? Oh, my!

David Kanter · Feb 28, 2006

No, that's not really all that relevant. If you have 4 FPUs and you

Sure, and more SSE units wouldn't?

An SSE unit doesn't issue anything, nor does an FPU. That's part of
the front end.

Not necessarily.

If you try and issue 50 instructions with only 1 LD/ST each cycle, you
are building what would technically be called an 'mistake'. If you
want to have a lot more IPC, eventually you'll need more LD/STs.
Alternatively, you can add instructions that LD or ST more data at once
(LD quad word or something).

Sure, but this isn't an architectural issue. ...a simple matter id
implementation.

It is a microarchitectural issue, which would basically result in a
brand new core. You would probably want to add branch units and
integer units as well. Also multithreading to make efficient use of
said extras.

So? If there is a justification...

I don't think AMD has the manpower for that. They are trying to do two
parallel designs for the future (mobile and server). I'm rather
confident that they don't have 3 good design teams.

Why? Sounds like a power-hog for no good reason. Aren't FPUs (at least
x87 style) going away?

Well...maybe for better SSE performance. You can turn things off when
you aren't using them, it's pretty easy, and die space is plentiful.
They could actually design SSE units, but I don't think that's likely
to happen...AMD's strategy of just doing scalar execution works
reasonably well. It certainly will up to the point where the vectors
get longer.

I wonder why? Perhaps because it's not as useful as its proponents
pretend? Is Intel making hay with them?

Well, consider how badly outclassed x86 is by every halfway reasonable
MPU with FMACs, the answer is yes. The POWER5, a 130nm design beats
all x86 designs (90nm, 65nm whatever) quite handily in SPECfp. Ditto
for Madison 9M, another 130nm design. FMAC is an undeniable advantage
no matter how you slice it.

DK

Keith · Feb 28, 2006

An SSE unit doesn't issue anything, nor does an FPU. That's part of
the front end.

If you try and issue 50 instructions with only 1 LD/ST each cycle, you
are building what would technically be called an 'mistake'. If you
want to have a lot more IPC, eventually you'll need more LD/STs.
Alternatively, you can add instructions that LD or ST more data at once
(LD quad word or something).

....or perhaps add more architected registers so you're not
thrashing the LD/ST unit with unneeded activity.

It is a microarchitectural issue, which would basically result in a
brand new core. You would probably want to add branch units and
integer units as well. Also multithreading to make efficient use of
said extras.

More ports doesn't mean a microarchitectural change. That's an
implementation detail.

I don't think AMD has the manpower for that. They are trying to do two
parallel designs for the future (mobile and server). I'm rather
confident that they don't have 3 good design teams.

That wasn't at issue. If there is a justification it'll be done.
You're good at throwing in red herrings, eh?

Well...maybe for better SSE performance. You can turn things off when
you aren't using them, it's pretty easy, and die space is plentiful.
They could actually design SSE units, but I don't think that's likely
to happen...AMD's strategy of just doing scalar execution works
reasonably well. It certainly will up to the point where the vectors
get longer.

Well, consider how badly outclassed x86 is by every halfway reasonable
MPU with FMACs, the answer is yes. The POWER5, a 130nm design beats
all x86 designs (90nm, 65nm whatever) quite handily in SPECfp. Ditto
for Madison 9M, another 130nm design. FMAC is an undeniable advantage
no matter how you slice it.

Now *you* are changing the architecture.

David Kanter · Feb 28, 2006

An SSE unit doesn't issue anything, nor does an FPU. That's part of

Duh! <sheesh!> There is no point in more execution units (of a
type) than the issue width.

That's not entirely true. It might be worth having more execution
resources than peak steady state issue/retire can support so you can
efficiently clear backlogs.

I don't think it is, but there are a lot of designs where fetch !=
issue != execute != retire width. The POWER5 issues upto 8
instructions, but can only execute and retire 5 per cycle. I think the
K7/8 also have some assymetries in the pipelines.

Generally, I agree with you though.

...or perhaps add more architected registers so you're not
thrashing the LD/ST unit with unneeded activity.

You still need to issue loads and stores. Either way, x86 has about 14
GPRs. 4 FPUs would consume 8 operands and product 4, so you're
basically flushing your reg file each cycle. That's assuming you don't
have any instructions that use 3 regs as input.

More ports doesn't mean a microarchitectural change. That's an
implementation detail.

Changing the number of reg ports by one or two is minor. Changing your
L1D cache porting is a pretty major undertaking, especially if you only
had a single port before. Ask Mitch Alsup or someone who does this for
a living.

That wasn't at issue. If there is a justification it'll be done.
You're good at throwing in red herrings, eh?

That's not a red herring if it relates to reality, which it does. AMD
cannot design 3 new architectures. They have said that they have a new
mobile and a new server uarch in the pipeline...combining those
statements results in a particular conclusion.

Now *you* are changing the architecture.

That's right. I'm pointing out that having an FMA is a huge
performance boost. Since x86 doesn't have one, I have to use other
things to show this. Ask anyone who designs chips if FMAs are a good
idea...

It doubles your FLOPs, and if you have the memory to support it, is a
huge boost.

DK

How AMD will take on Intel Woodcrest: twice the FPU's

Yousuf Khan

nobody

George Macdonald

David Kanter

The little lost angel

Grumble

Grumble

George Macdonald

Yousuf Khan

Yousuf Khan

David Kanter

David Kanter

George Macdonald

David Kanter

Yousuf Khan

David Kanter

Keith

David Kanter

Keith

David Kanter