64 bit 604-pin Xeon Nocona

  • Thread starter Thread starter RusH
  • Start date Start date
R

RusH

Intel showed this thingie on 2004 Barcelona IDF.
The funny thing - 64bitness is realized with 2 32bit units and 64bit
instructions need TWICE the time compared to 32bit instructions. I may be
wrong here, but thats what I understood when reading/hearing about this
chip. My impression was that Intel concentrated on amount of memory
adressed by the chip silenthy ommiting any performance boost pros.

So is this brand new shiny Intel EM64T really THAT POS ?

Pozdrawiam.
 
RusH said:
Intel showed this thingie on 2004 Barcelona IDF.
The funny thing - 64bitness is realized with 2 32bit units and 64bit
instructions need TWICE the time compared to 32bit instructions. I may be
wrong here, but thats what I understood when reading/hearing about this
chip. My impression was that Intel concentrated on amount of memory
adressed by the chip silenthy ommiting any performance boost pros.
So is this brand new shiny Intel EM64T really THAT POS ?

It's not the integers, it's the addressing that matters for the actual
applications that need 64-bit support on x86. There are also scientific
computing applications and things like crypto which are faster with 64-bit
ALUs, but by and large Intel is going to still push those towards Itanium
(along with the really big transaction processing stuff.)
 
Intel showed this thingie on 2004 Barcelona IDF.
The funny thing - 64bitness is realized with 2 32bit units and 64bit
instructions need TWICE the time compared to 32bit instructions. I may be
wrong here, but thats what I understood when reading/hearing about this
chip. My impression was that Intel concentrated on amount of memory
adressed by the chip silenthy ommiting any performance boost pros.

The memory addressed is the reason for 64-bits in the first place!
There is no "performance boost", unless you count the extra registers.
If all else is equal, 64-bit is almost always going to be slower than
32-bit code until you start running into memory addressing problems
(at which point 32-bit code might simply break down altogether). The
only reason why AMD64 code isn't slower (and, in fact, is occasionally
faster) is that AMD doubled the number of integer registers in 64-bit
long mode vs. 32-bit mode.

Generally speaking you don't want to use 64-bit integers anyway, all
they do is waste memory bandwidth and cache space vs. 32-bit ones.
The only time you want 64-bit integers is on the rare occasion when
you actually need integers with a range greater than 4 billion. With
32-bit code you would need two separate variables and two separate
registers to store those variables, plus at least two instructions to
deal with them. With 64-bit code you can handle this all at once.
Situations where such variables are needed are rare though, and it's
even more rare that they are used in a performance-critical part of
the code.
So is this brand new shiny Intel EM64T really THAT POS ?

Probably not, though until someone gets a chance to actually test it
there isn't much to go on. Remember that the Opteron also takes a
fair bit more time to do 64-bit integer instructions vs. 32-bit ones
(eg the Opteron can do one int multiply ever cycle and with a 3 cycle
latency vs. one every other cycle and a 4 cycle latency in 64-bit
mode).

The real question is how well Intel's x86-64 chip can handle 64-bit
pointers vs. 32-bit ones. Where you only very rarely want to use
64-bit integers, you ALWAYS use 64-bit addresses (pointers) while in
long mode. AMD did a very good job of keeping this part of their chip
up to snuff, and Intel is going to want to do the same if they want to
remain performance-competitive.
 
The memory addressed is the reason for 64-bits in the first place!
There is no "performance boost", unless you count the extra registers.
If all else is equal, 64-bit is almost always going to be slower than
32-bit code until you start running into memory addressing problems
(at which point 32-bit code might simply break down altogether). The
only reason why AMD64 code isn't slower (and, in fact, is occasionally
faster) is that AMD doubled the number of integer registers in 64-bit
long mode vs. 32-bit mode.


Really? The integrated memory controller helps none? If you're
going to count the extra registers (not available in 32bit mode),
perhaps the memory subsystem helps too (is available in 32 bit
mode)?
Generally speaking you don't want to use 64-bit integers anyway, all
they do is waste memory bandwidth and cache space vs. 32-bit ones.

That's a rather broad statement.
The only time you want 64-bit integers is on the rare occasion when
you actually need integers with a range greater than 4 billion. With
32-bit code you would need two separate variables and two separate
registers to store those variables, plus at least two instructions to
deal with them.

Or want to avoid floats. Or have arrays of logicals, or...
Logic simulation is one place that the-more-the-merrier. Crypto
is another, just off the top of my head. There are many reasons
for 64-bit processors, though as you've said the address space is
the most obvious one. Note that now that real memory space is
getting close to the limits of virtual, more virtual is goodness
too.

With 64-bit code you can handle this all at once.
Situations where such variables are needed are rare though, and it's
even more rare that they are used in a performance-critical part of
the code.

Depends on your definition of "performance-critical". I'd say
that were it performance critical, it *would* be done in 64-bit
code, if possible. It's not like 64-bit processors were invented
yesterday and no one has a use for them until the next millennia.
Probably not, though until someone gets a chance to actually test it
there isn't much to go on. Remember that the Opteron also takes a
fair bit more time to do 64-bit integer instructions vs. 32-bit ones
(eg the Opteron can do one int multiply ever cycle and with a 3 cycle
latency vs. one every other cycle and a 4 cycle latency in 64-bit
mode).

Multiplies are a bad example. How about an add? shift?
Logical? THe only thing worse than a multiply is a divide, and
that has nothing to do with AMD or anyone else. Indeed the Intel
P4 is *very* bad at integer multiplies, even 32 bit ones.
The real question is how well Intel's x86-64 chip can handle 64-bit
pointers vs. 32-bit ones. Where you only very rarely want to use
64-bit integers, you ALWAYS use 64-bit addresses (pointers) while in
long mode. AMD did a very good job of keeping this part of their chip
up to snuff, and Intel is going to want to do the same if they want to
remain performance-competitive.

Obviously. When was the last time you multiplied addresses
though. ;-)
 
Tony said:
The memory addressed is the reason for 64-bits in the first place!
There is no "performance boost", unless you count the extra registers.
If all else is equal, 64-bit is almost always going to be slower than
32-bit code until you start running into memory addressing problems
(at which point 32-bit code might simply break down altogether). The
only reason why AMD64 code isn't slower (and, in fact, is occasionally
faster) is that AMD doubled the number of integer registers in 64-bit
long mode vs. 32-bit mode.

Yeah, but so did Intel. It too has doubled the number of integer registers
in 64-bit mode vs. 32-bit mode. Or did you forget that EM64T is a direct
copy of AMD64? So with all of this being equal, why would Intel be so far
behind?

Here's some articles about it:

English: http://www.theinquirer.net/?article=15149

German: http://www.heise.de/ct/04/08/020/

Yousuf Khan
 
Intel showed this thingie on 2004 Barcelona IDF.
The funny thing - 64bitness is realized with 2 32bit units and 64bit
instructions need TWICE the time compared to 32bit instructions. I may be
wrong here, but thats what I understood when reading/hearing about this
chip. My impression was that Intel concentrated on amount of memory
adressed by the chip silenthy ommiting any performance boost pros.

I don't think it needs twice the time to execute anything. Ignoring all
the speculation he has about Pentium 5s and 6s, this article
http://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html
and predecessors in the series seems to indicate that there are indeed two
32-bit cores and two sets of 32-bit registers but that they'd work like
"the good old bit slices", i.e. like many of the 80s minicomputers. E.g.,
for their MV Series 32-bit minicomputers, Data General used AMD 2900
8-bit(4-bit? - too hazy) bit slice processors.
So is this brand new shiny Intel EM64T really THAT POS ?

It *is* gonna get interesting when we get to compare 64-bit performance of
the two CPUs.:-)

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??
 
Really? The integrated memory controller helps none? If you're

Uhh, last I checked the integrated memory controller worked just fine
in 32-bit mode as well as 64-bit mode! :>

Perhaps I should have made the above a bit more clear, I was talking
about 32-bit vs. 64-bit code on the Athlon64/Opteron, and NOT about
32-bit code on the AthlonXP vs. 64-bit code on the Athlon64/Opteron.
That's a rather broad statement.

Fairly broad, hence the reason why I said "Generally speaking". There
are situations where 64-bit integers are beneficial, 99 times out of
100 that is not the case.
Or want to avoid floats. Or have arrays of logicals, or...
Logic simulation is one place that the-more-the-merrier. Crypto
is another, just off the top of my head. There are many reasons

Sure, there are plenty of applications that will benefit from 64-bit
integers, but as mentioned above, 99 times out of 100 you're probably
going to be better off using 32-bit ints.
for 64-bit processors, though as you've said the address space is
the most obvious one. Note that now that real memory space is
getting close to the limits of virtual, more virtual is goodness
too.

For sure! I still see some of the trade-rags saying that 64-bit chips
are only helpful for people who want more than 4GB of physical memory!
Not so at all! Even with only 2GB of physical memory your system
starts to become rather constrained by the limits of a 32-bit chip.
Depends on your definition of "performance-critical". I'd say
that were it performance critical, it *would* be done in 64-bit
code, if possible. It's not like 64-bit processors were invented
yesterday and no one has a use for them until the next millennia.

What I meant by the above is that the most common uses for 64-bit
variables in the "standard" applications that most users run, ie
office suites, word processors, games, etc., are for "extra"
variables. Things like keeping track of time and date, or some data
pointers in the file system or similar uses. A lot of the real
number-crunching is done using 32-bit ints. Of course, a certain part
of this is a chicken and egg kind of thing. With a 32-bit chip (by
far the most common type of x86 chip) it might be worthwhile doing the
number crunching with 32-bit code, hence 64-bit chips wouldn't see
much benefit. On the other, 64-bit chips COULD do things a faster if
the code were rewritten to handle the number crunching using 64-bit
ints.
Multiplies are a bad example. How about an add? shift?
Logical? THe only thing worse than a multiply is a divide, and
that has nothing to do with AMD or anyone else. Indeed the Intel

I just mentioned the multiply because that was the only instruction
that I had all the numbers for off-hand. I seem to recall that most
of the simple instructions like those you listed have the same
throughput in 64-bit mode vs. 32-bit mode on the Opteron, but some
(most? all?) have higher latency.
P4 is *very* bad at integer multiplies, even 32 bit ones.

My understanding is that Intel has "fixed" this problem with the
Prescott core for the P4.
Obviously. When was the last time you multiplied addresses
though. ;-)

Hehe, true enough! Might be able to get some really nifty hack if you
did multiply some addresses... if you could get it to do anything
remotely useful! :>
 
Nate said:
There are also scientific computing applications and things like
crypto which are faster with 64-bit ALUs, but by and large Intel is
going to still push those towards Itanium (along with the really
big transaction processing stuff.)

As far as I can tell, cryptography applications make heavy use of bit
rotates, and Itanium 2 can only perform two shifts per cycle, i.e. one
bit rotate per cycle.
 
Tony said:
I just mentioned the multiply because that was the only instruction
that I had all the numbers for off-hand. I seem to recall that most
of the simple instructions like those you listed have the same
throughput in 64-bit mode vs. 32-bit mode on the Opteron, but some
(most? all?) have higher latency.

Software Optimization Guide for K8 Processors
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
Appendix C - Instruction Latencies

AFAICT, all integer instructions have the same latency except:
BSF
DIV, IDIV
MUL, IMUL
LOOPcc, INVLPG, SGDT, SIDT, WBINVD (faster in 64-bit mode)
 
Uhh, last I checked the integrated memory controller worked just fine
in 32-bit mode as well as 64-bit mode! :>

Perhaps I should have made the above a bit more clear, I was talking
about 32-bit vs. 64-bit code on the Athlon64/Opteron, and NOT about
32-bit code on the AthlonXP vs. 64-bit code on the Athlon64/Opteron.

Ah, I wuz talking about the AMD vs. Intel approach to memory.
Talking past one another is a Usenet hazard. ;-)
Fairly broad, hence the reason why I said "Generally speaking". There
are situations where 64-bit integers are beneficial, 99 times out of
100 that is not the case.

Maybe *your* 99. ...not mine! I assure you that if more bits
are available they *will* be used. Perhpas not this month, but
they will be used. There are tons of serious applications that
can use the bits.
Sure, there are plenty of applications that will benefit from 64-bit
integers, but as mentioned above, 99 times out of 100 you're probably
going to be better off using 32-bit ints.

Me? 99 times? That's why I have a 64b machine on (well under)
my desk? 1/99 times? BTW, it's been there for over two years.
Of course a wise[*] man told me today that the reason one had
64bit processors was to develop 64bit processors (ouch! ;-).
For sure! I still see some of the trade-rags saying that 64-bit chips
are only helpful for people who want more than 4GB of physical memory!
Not so at all! Even with only 2GB of physical memory your system
starts to become rather constrained by the limits of a 32-bit chip.

The same wise[*] man said that 64bit processors won't be needed
until *INTEL* is ready with their 64bit products. I did say that
he was a wise man.
What I meant by the above is that the most common uses for 64-bit
variables in the "standard" applications that most users run, ie
office suites, word processors, games, etc., are for "extra"
variables. Things like keeping track of time and date, or some data
pointers in the file system or similar uses.


....like file pointers? ;-) Again, virtual memory is an
important concept. Once real=virtual why bother with MMUs?
Well...
A lot of the real
number-crunching is done using 32-bit ints. Of course, a certain part
of this is a chicken and egg kind of thing. With a 32-bit chip (by
far the most common type of x86 chip)

By definition. ...Since Intel doesn't have one yet, it's a
"chicken" without the "egg". ;-)

it might be worthwhile doing the
number crunching with 32-bit code, hence 64-bit chips wouldn't see
much benefit. On the other, 64-bit chips COULD do things a faster if
the code were rewritten to handle the number crunching using 64-bit
ints.

No more difficult than doing FPU => SSE, indeed it's a lot
simpler to deal with FX than FP. ...None of that nasty rounding
and denorm stuff. Far faster too.
I just mentioned the multiply because that was the only instruction
that I had all the numbers for off-hand. I seem to recall that most
of the simple instructions like those you listed have the same
throughput in 64-bit mode vs. 32-bit mode on the Opteron, but some
(most? all?) have higher latency.

Nope. If so it' a *bad* design. If you know how to do long
multiplication and division, it's obvious why multiply and divide
are horrendous. Add/subtract/logicals are truly simple (linear),
by comparison.
My understanding is that Intel has "fixed" this problem with the
Prescott core for the P4.

Perhaps, though I haven't seen anything official. Indeed I've
heard all sorts of conflicting opinion on all four quadrants. Me
thinks Intel is too embarrassed about the P4 to admit what
they've improved.
Hehe, true enough! Might be able to get some really nifty hack if you
did multiply some addresses... if you could get it to do anything
remotely useful! :>

Ok, enquiring minds want to know what the alcohol content is of
the modern Canukistan beer?! ...and have you told Red, about it?
;-)
 
KR said:
Perhaps, though I haven't seen anything official. Indeed I've
heard all sorts of conflicting opinion on all four quadrants.
Me thinks Intel is too embarrassed about the P4 to admit what
they've improved.

You haven't looked very hard, have you? :-)

IA-32 Optimization Reference Manual
http://intel.com/design/pentium4/manuals/248966.htm
Appendix C - IA-32 Instruction Latency and Throughput

As far as I can tell,

0xF3 = Prescott
0xF2 = Northwood and Willamette (?)

0xF3 0xF2

BSF/BSR 16 8
BSWAP 1 7

MUL 10 14-18
DIV 66-80 56-70

ROL/ROR 1 4
SHL/SHR 1 4

The throughput of MUL was also vastly improved:
A new MUL every 5 cycles on 0xF2 vs every cycle on 0xF3.

MUL is still slower on Prescott than on Hammer :-)
 
You haven't looked very hard, have you? :-)

You would be right. ;-)

I'm looking for an admission that they sucked. All I see is a
"look how great we are now compared to how 'great' we were last
week"! ;-)
IA-32 Optimization Reference Manual
http://intel.com/design/pentium4/manuals/248966.htm
Appendix C - IA-32 Instruction Latency and Throughput

As far as I can tell,

They aren't being exactly up front, eh? ;-)
0xF3 = Prescott
0xF2 = Northwood and Willamette (?)

0xF3 0xF2

BSF/BSR 16 8
BSWAP 1 7

MUL 10 14-18
DIV 66-80 56-70

ROL/ROR 1 4
SHL/SHR 1 4

The throughput of MUL was also vastly improved:
A new MUL every 5 cycles on 0xF2 vs every cycle on 0xF3.

MUL is still slower on Prescott than on Hammer :-)

Certainly the word from the micro-architects was that they
couldn't fit the P4, with multiplier and shifter, so had to kick
something off. THe question is "why", if it didn't gain any
advantage (other than raw MHz) over what they already had?

Intel has lost a ton of credibility over the last five years
(started with the PII and then the neutered PII, IMHO).
 
Ah, I wuz talking about the AMD vs. Intel approach to memory.
Talking past one another is a Usenet hazard. ;-)

Hehe, yup!
Maybe *your* 99. ...not mine! I assure you that if more bits
are available they *will* be used. Perhpas not this month, but
they will be used. There are tons of serious applications that
can use the bits.

Maybe 99 times out of 100 for 99 out of 100 people? :>

I suspect that neither you nor I are "typical" computer users. Even
for a lot of the work I do though (normally centered more around
compiling code for embedded chips, though right now it's centered
around writing resumes to find a new job! :> ) doesn't tend to make
much use of 64-bit ints. On the other hand, if you're doing some big
Place 'n Route run, 64-bit ints could very well come in handy from
time to time.
Sure, there are plenty of applications that will benefit from 64-bit
integers, but as mentioned above, 99 times out of 100 you're probably
going to be better off using 32-bit ints.

Me? 99 times? That's why I have a 64b machine on (well under)
my desk? 1/99 times? BTW, it's been there for over two years.
Of course a wise[*] man told me today that the reason one had
64bit processors was to develop 64bit processors (ouch! ;-).

Hehe, unfortunately a rather accurate statement for many at this point
in time! Of course, developing 64-bit processors is not exactly what
most would call "typical" computer use, even among the power users of
the world who are likely to buy new high-end systems!
For sure! I still see some of the trade-rags saying that 64-bit chips
are only helpful for people who want more than 4GB of physical memory!
Not so at all! Even with only 2GB of physical memory your system
starts to become rather constrained by the limits of a 32-bit chip.

The same wise[*] man said that 64bit processors won't be needed
until *INTEL* is ready with their 64bit products. I did say that
he was a wise man.

That follows with what Intel's saying about 64-bit chips. "No one
will need 64-bit chips until around the end of the decade!". Somehow
I don't buy it though. Perhaps I'm not a very wise man? ;-)
By definition. ...Since Intel doesn't have one yet, it's a
"chicken" without the "egg". ;-)

So does that make AMD the "turkey" in this whole argument? :>
No more difficult than doing FPU => SSE, indeed it's a lot
simpler to deal with FX than FP. ...None of that nasty rounding
and denorm stuff. Far faster too.

Still requires some lazy programmers to get off their butts though!
I'm sure we'll see some applications switch to 64-bit code really
quickly and they will see a large benefit, but by and large I don't
expect to see a large quantity of 64-bit code until the end of the
decade at least. Too many legacy systems and programmers who don't
have the time and/or money to support two code streams.
Nope. If so it' a *bad* design. If you know how to do long
multiplication and division, it's obvious why multiply and divide
are horrendous. Add/subtract/logicals are truly simple (linear),
by comparison.

For sure! Actually according to another message posted in this group
I was wrong, almost all the simple instructions do indeed have the
same latency in 64-bit mode as in 32-bit.
Perhaps, though I haven't seen anything official. Indeed I've
heard all sorts of conflicting opinion on all four quadrants. Me
thinks Intel is too embarrassed about the P4 to admit what
they've improved.

I'm not sure if it's entirely embarrassment or simply marketing. They
still have a LOT of Northwood chips in the market, and they might be
worried about pushing the Prescott too hard before production is fully
ramped. Of course, that doesn't seem to be much of a problem right
now, given the rather negative reception that Prescott has received so
far.

Still, you gotta figure that they did SOMETHING to improve the chips!
They did more than double the number of transistors in the thing,
hopefully they've got something to show for it!
Ok, enquiring minds want to know what the alcohol content is of
the modern Canukistan beer?! ...and have you told Red, about it?
;-)

Who? Me? Drinking beer? Never! Well, maybe one or two.. It IS
hockey playoff time here! :>
 
Back
Top