AMD for numerics - hype or not?

  • Thread starter Thread starter alex goldman
  • Start date Start date
A

alex goldman

AMD marks its processors with with numbers like 4000+ in addition to the
clock rate. I believe the number is supposed to mean how fast it is
relative to Intel chips, i.e. a 3200+ AMD chip should be as fast as a
3.2Ghz P4.

I was wondering if this is true for numerics, especially its holy grail,
linear algebra.

It's hard to find any scientific benchmarks that directly compare AMD64 and
P4 systems, but here's what I did find:

http://math-atlas.sourceforge.net/timing/

The "% peak" varies depending on bus speed, RAM, matrix size, etc., while
"PEAK" is a theoretical value specific to the CPU.

From the table, we see that

2.8Ghz P4E achieves 77% * 5.2 = 4.3 Gflops

while

2800+ AMD64's cousin (*), 1.6Ghz Opteron achieves 88% * 3.2 = 2.8 Gflops

The difference is quite big. It appears 2800+ AMDs are no match for 2.8Ghz
P4s.

Opinions?



(*) I would prefer to see Athlon 64's, but one has to made do with what data
is available. 1.6Ghz Opteron was used in the timings, and I looked up its
closest **00+ cousin on
http://en.wikipedia.org/wiki/List_of_AMD_Athlon_64_microprocessors
 
AMD marks its processors with with numbers like 4000+ in addition to the
clock rate. I believe the number is supposed to mean how fast it is
relative to Intel chips, i.e. a 3200+ AMD chip should be as fast as a
3.2Ghz P4.
True fro the mainstream, Ahlon XP, 64, etc. The Sempron however is
compared to the Celeron and is rated under an entirely different benchmark
suite.
I was wondering if this is true for numerics, especially its holy grail,
linear algebra.

It's hard to find any scientific benchmarks that directly compare AMD64
and P4 systems, but here's what I did find:

http://math-atlas.sourceforge.net/timing/

The "% peak" varies depending on bus speed, RAM, matrix size, etc.,
while "PEAK" is a theoretical value specific to the CPU.

From the table, we see that

2.8Ghz P4E achieves 77% * 5.2 = 4.3 Gflops

while

2800+ AMD64's cousin (*), 1.6Ghz Opteron achieves 88% * 3.2 = 2.8 Gflops

The difference is quite big. It appears 2800+ AMDs are no match for
2.8Ghz P4s.

Opinions?
I suggest you go back and read these parts more closely.
"The following table gives a rough estimate of ATLAS's asymptotic DGEMM
performance as a percentage of peak for a variety of systems."

You might want to note "rough estimate".

"Note that these numbers reflect asymptotic DGEMM speed only, and having a
high percentage does not necessarily make the machine faster for real
computational tasks."

So basically, everything there is worthless.
 
Wes said:
True fro the mainstream, Ahlon XP, 64, etc. The Sempron however is
compared to the Celeron and is rated under an entirely different benchmark
suite.

I suggest you go back and read these parts more closely.
"The following table gives a rough estimate of ATLAS's asymptotic DGEMM
performance as a percentage of peak for a variety of systems."

You might want to note "rough estimate".

"Note that these numbers reflect asymptotic DGEMM speed only, and having a
high percentage does not necessarily make the machine faster for real
computational tasks."

So basically, everything there is worthless.

I had also emailed the primary author of ATLAS and asked to comment, so
hopefully, any misunderstandings will be corrected.

My understanding is that "PEAK", like I wrote above, is a "theoretical
maximum" (2x the clock speed for Opterons and P4s, 4x the clock speed for
G5)

However "% peak" *times* "PEAK" is what was actually
measured /empirically/ , so it's no more worthless than any other
benchmark, perhaps less so, because it compares code optimized for each
specific CPU, not i386.
 
alex goldman said:
It's hard to find any scientific benchmarks that directly compare AMD64 and
P4 systems, but here's what I did find: ....
(*) I would prefer to see Athlon 64's, but one has to made do with what data
is available. 1.6Ghz Opteron was used in the timings, and I looked up its
closest **00+ cousin on
http://en.wikipedia.org/wiki/List_of_AMD_Athlon_64_microprocessors

I realize it introduces yet one more variable in the mix
but there are benchmark results for Mathematica on both
P4 and Athlon64's. The latest version of Mathematica claims
to have specific 64 bit optimizations built in and has
optimized other areas of numeric intensive work. Google can
find this for you.

I don't get paid to say that.
 
Don said:
I realize it introduces yet one more variable in the mix
but there are benchmark results for Mathematica on both
P4 and Athlon64's. The latest version of Mathematica claims
to have specific 64 bit optimizations built in and has
optimized other areas of numeric intensive work. Google can
find this for you.


You mean this ?

http://www2.staff.fh-vorarlberg.ac.at/~ku/karl/timings50.html

Someone needs to run multivariate regression to make sense of that data.
 
alex goldman said:
AMD marks its processors with with numbers like 4000+ in addition to the
clock rate. I believe the number is supposed to mean how fast it is
relative to Intel chips, i.e. a 3200+ AMD chip should be as fast as a
3.2Ghz P4.

I was wondering if this is true for numerics, especially its holy grail,
linear algebra.

It's hard to find any scientific benchmarks that directly compare AMD64 and
P4 systems, but here's what I did find:

http://math-atlas.sourceforge.net/timing/

The "% peak" varies depending on bus speed, RAM, matrix size, etc., while
"PEAK" is a theoretical value specific to the CPU.

From the table, we see that

2.8Ghz P4E achieves 77% * 5.2 = 4.3 Gflops

while

2800+ AMD64's cousin (*), 1.6Ghz Opteron achieves 88% * 3.2 = 2.8 Gflops

The difference is quite big. It appears 2800+ AMDs are no match for 2.8Ghz
P4s.

Opinions?



(*) I would prefer to see Athlon 64's, but one has to made do with what data
is available. 1.6Ghz Opteron was used in the timings, and I looked up its
closest **00+ cousin on
http://en.wikipedia.org/wiki/List_of_AMD_Athlon_64_microprocessors

Hi Alex

I have two HP "pavilion " PC's at home
One uses the 3.2 GHz Pentium 4 ( 540J)and the
other uses an AMD Athalon 64 - 3200 which has a 2.0 Ghz clock
I also wrote a C++ linear algebra package "ppLinear" which does most of the
Linear algebra functions. Here are some of the time results (milliseconds)
for the two

Pentium 4
Athalon 64
SPFP matrix-matrix multiply 141 125
DPFP matrix-matrix multiply 265 280
SPFP solve Ax=b 219 125
DPFP solve Ax=b 438 203

Notes
SPFP means single precision floating point
DPFP means double precision floating point
matrix size 500 x 500
compiler VC6
OS windows xp

Note that a matrix-matrix multiply for a size 500 x 500 has 2.5*10**8
(2N**3) flops so for an execution time of 125 millisecs the system is
running at a two gigaflop rate. Most linear algebra functions run much
slower than m-m multiply however.
In general the AMD runs faster but it varies a lot from function to
function.
Where the AMD really stands out is in compilation. It compiles the
library 30% faster than the pentium

regards...Bill Shortall
 
alex goldman said:
The difference is quite big. It appears 2800+ AMDs are no match for 2.8Ghz
P4s.

Opinions?

That's only for DGEM. Run Stream and you'll get other results. Run some
sparse benchmark and you'll get other other results. You have to pick a
benchmark that has some relation to your application. Otherwise there's
no point.

V.
 
Victor said:
That's only for DGEM.

It's called DGEMM
Run Stream and you'll get other results.

Too generic a name to mean anything to me.
Run some
sparse benchmark and you'll get other other results. You have to pick a
benchmark that has some relation to your application. Otherwise there's
no point.

Have you read the message you are replying to? I "picked" what I could find.
Incidentally, ATLAS is very much related to my application.
 
Do you want numbers for real numerical codes or linpack-type numbers?

For large matrices which overflow the cache of the microprocessor
performance is dominated by the speed of accessing main memory and only
weakly correlated with main processor clock speed. Of course, this is
not the case for numerical problems small enough to fit in the cache.
You will get very different cost versus performance results for these
two cases which a single performance number cannot represent.

When comparing the numerical performance of computers (for my purposes)
I have found the NAS Parallel Benchmarks to be very informative unlike
most other benchmarks like linpack. This compares performance over a
range of grid sizes and a range of common solvers using compiled
Fortran and C.

I have also found AMD to be helpful in providing numbers for numerical
benchmarks if they cannot be found by browsing the web.

www.spec.org gives some numbers which are not wholly useless for AMD
and Intel performance but I would not recommend basing any purchasing
decisions on them.
 
andy said:
Yes? The source code is available.

How will it help me choose whether to buy AMD or Intel?

Besides, if I had access to all sorts of systems to benchmark stuff on
myself, why would I want to run someone else's benchmarks instead of my own
programs? Sorry, but that wasn't helpful at all.
 
Run Stream and you'll get other results.
Too generic a name to mean anything to me.

Tut, tut. Ill-informed _and_ foul-mannered?

Just typing "stream" into Google will give you the proper hit at number 5
- that surprised even me. Add "McCalpin" to it and you'll have it in first
place.

Not knowing the Stream benchmark and discussing processor performance is
like discussing music and being deaf - it can be done, but it won't get you
very far.
Have you read the message you are replying to? I "picked" what I could find.
Incidentally, ATLAS is very much related to my application.

So? ATLAS surely contains both code suited to dense and to sparse matrices.
They behave quite differently on modern processors. Have you read the message
you are replying to?

Jan
 
Jan said:
Tut, tut. Ill-informed _and_ foul-mannered?

Speaking of yourself, obviously.
Just typing "stream" into Google will give you the proper hit at number 5
- that surprised even me.

This is what I get at #5: http://www.apple.com/quicktime/qtv/wwdc05/

Does not seem to be remotely relevant, since Apple software does not, nor
plans to run on AMD. And would I need to use some sort of telepathy to even
know about "google number 5" ?

When someone uses "stream" in the context of computer science, I think of

http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-24.html
 
alex said:
Speaking of yourself, obviously.


This is what I get at #5: http://www.apple.com/quicktime/qtv/wwdc05/

Look at Jan's e-mail address. Consider which google he might use. When I
do that I do indeed get the stream benchmark at number 5. Using "my"
google I get it at number three and your quicktime page at number 4.

Let's just say that both you and Jan are guilty of a little
parochialism, but you are guilty of not using your brain as well.

john
 
I am not sure I am wise to continue this thread but I am having my
morning coffee.

The benchmarks will help you choose between AMD and Intel if they
measure what is representative of what you will do with the machine.
You declined to answer this when I asked earlier. In my case I want to
measure numerical performance for the range of numerically intensive
research codes that would be run on the machine over its lifetime.
These are C/Fortran codes which tend not to call a large proportion of
low level hand coded routines. The NAS Parallel Benchmarks not only
provide this information but allow us to investigate when we see
interesting/anomolous behaviour in one or two figures (almost always).
Yes it takes an hour or two to gather the information but it is
informative and reliable in the sense it correlates pretty well with
what we get on the machines with our range of codes. In my experience,
manufacturers are prepared to supply the figures to guide purchasing
decisions and many of the engineering/scientific computers run these
tests as part of the commissioning process if you talk to them. They
tend not to be used for promotional purposes because they are a set of
numbers rather than a single number.

Like you I initially believed the best benchmark was to take one of our
codes and get the manufacturers to run it on their machines. In the 80s
I did this with a "real" code and was assured that the code had been
deleted afterwards. A year later when we are buying the next machine
not only does the original manufacturer immediately come up with
performance numbers for this code but one of their competitors I had
not dealt with in the previous round also produced numbers for it!
After that I supplied cut down benchmark code which contained routines
I was sure of the copyright. With hindsight this was probably too much
effort for too little information. I really wanted to measure the
performance for the range of codes we used at the time and the codes we
would write in the future. The NAS tests are almost certainly a better
measure of this than running one of our existing codes.
 
J.V.Ashby said:
but you are guilty of not using your brain as well.

I think I've made it abundantly clear why I don't feel like running for
google every time I hear "stream". Use *your* brain.
 
andy said:
Like you I initially believed the best benchmark was to take one of our
codes and get the manufacturers to run it on their machines.

That's not the point. There is obviously a communication problem here in the
form of your preferring typing to reading. I'm thinking of which laptop to
buy - I don't have the negotiating power to have the manufacturers run the
benchmarks of my choice on the hardware they are offering. I can only look
up existing benchmarks and results, and draw conclusions from that.
 
alex said:
I think I've made it abundantly clear why I don't feel like running
for google every time I hear "stream". Use *your* brain.

And yet the information was there for you, and easily retrievable. And
as far as I can see from your posts, you still don't have it. Whose
loss?

john
 
Back
Top