Pretty good explanation of x86-64 by HP

  • Thread starter Thread starter Yousuf Khan
  • Start date Start date
Note that the STREAM bandwidth and lmbench latency changes with every
cpuspeedbump. So clearly part of the memory controller is at the cpu
core frequency, or a related frequency, and not at the HT frequency,
or the SDRAM external bus frequency.

That does *not* mean that the memory corntoller runs at the core speed.
It would be nuts to assume such. Would you assume the cashes of the
PII run at the the I/O bus speed?
[/QUOTE]

"or a related frequency", i.e. based on the cpu frequency with a
constant divider.
Isn't his a rather egotistical statement?

No, it follows Usenet tradition: post only to groups that you read.

But thanks for giving me the benefit of the doubt.

-- greg
 
Another example would be making sure that people understand that when
Opteron goes dual core, unless you double the memory bandwidth
available, you effectively cut the bandwidth per core in half. This will
impact some workloads quite dramatically. Has AMD made public statements
about supporting higher local bandwidth for the dual core chip?

No public statements that I know of, but there are rumors that the
90nm Opterons, due Real Soon Now, will support DDR2 in addition to
plain old DDR. See e.g.

http://www.xbitlabs.com/news/cpu/display/20040212022200.html

By the time dual core Opterons arrive, I suspect that DDR2-800 will
also be available, thus providing twice the memory BW compared to the
current single core offerings using DDR-400.
 
What braindamaged newsreader are you using that won't let you right
click the link in the newsreader?

Clicking on the link in the newsreader, supposing I could do that, would
simply cause the link to open in a browser window. Which is exactly what
I achieved by cutting and pasting.

Maybe some newsreaders do allow right-clicking links. Such newsreaders
would probably also do dangerous and reckless things like rendering HTML
posts instead of displaying them in all their <angle bracket> glory.

This could result in having a brain-damaged computer, were I to view the
wrong post by accident.

As the posting in question was a text posting, this means that the
newsreader would have to guess at what constituted an URL, as well, with
no doubt occasional hilarious results.

John Savard
http://home.ecn.ab.ca/~jsavard/index.html
 
And the conclusion was that a multi-CPU Opteron system must then be
UMA, rather than that the NUMA "optimizations" were crap?

There is a cost to treating memory as NUMA. The benefit you get in
exchange for that cost is dependent upon how NU the MA is. The point is that
MA on an Opteron system with 2 to 8 processors is so close to U that
treating it in most cases, it's effectively U.

The scaling advantage comes largely from the architecture of a single
processor. The memory controller is on the chip. The main reason this
matters is that it means that local memory accesses don't have to content
with any other inter-CPU or I/O traffic. The other advantage comes from the
number of HT interfaces. Corresponding Intel CPUs have only a single FSB
over which all traffic must flow.

Above 8 processors, things get much more complicated. But it doesn't
seem like there's much of a (mainstream commercial) market at that scaling
level yet.

DS
 
And sometimes 50%...

Sure, there will be extreme cases in everything.
I admit I'm from the HPC-sector and memory bandwidth is very important
to many applications here.

One thing that you need to keep in mind is that you represent a VERY
small minority here in terms of PC server sales. Just because it
matters to your application probably doesn't have much reference to
the bulk of the buying public, and it almost certainly isn't going to
have implications for what the marketing people write in the trade
rags.
It's a pretty strange argument in my eyes, "If you ignore the
applications that run poorly because of property X, then it makes
sense to downplay property X." True, but not helpful if you have such
an application.

Ahh, but it's VERY helpful if you're in the marketing department! :>

In the end, the people that are going to take a performance due to
lack of NUMA optimizations probably already know as much and have
factored it into their buying decisions. The people who are talking
to Dell or HPaq's server sales and are thinking about an Opteron
system but are worried that this here NoooMah thingy might cause their
application to run slow most likely don't have to worry about much.
Hence SUMO.

It's all a matter of perspective.
 
Sure, there will be extreme cases in everything.


One thing that you need to keep in mind is that you represent a VERY
small minority here in terms of PC server sales. Just because it
matters to your application probably doesn't have much reference to
the bulk of the buying public, and it almost certainly isn't going to
have implications for what the marketing people write in the trade
rags.

I think you're underestimating the size of the "workstation" market, which
will include people finding they can migrate down to PC-grade CPUs to
replace old "higher power" systems as well as people on the lower-end
fringe who may have grown their problem complexity beyond a uni-PC, or who
*could* get by with a fastish PC but like the comfort of the move up to
dual for future growth. Add them to the current established base of CAD,
engineering and modeling etc. applications and there is a decent sized
market.

There are a lot of mathematical/engineering problems out there which are
just part of everyday business computing - many *used* to be considered HPC
and are now quite routine on desktop sized boxes. In many cases,
proprietary (purchased) software is used and the algorithmic methods are
only understood fairly superficially by the user; what that user wants is
response, whether it's measured in minutes, hours or a day or more. The
software vendor thus feels responsible for supplying the best combination
of software and recommended hardware selection.

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??
 
That does *not* mean that the memory corntoller runs at the core speed.

"or a related frequency", i.e. based on the cpu frequency with a
constant divider.[/QUOTE]

Ok, how many "unrelated frequencies" are there in a CPU? Let's get real
here.
No, it follows Usenet tradition: post only to groups that you read.

No, that is *not* Usenet tradition. The tradition is to limit
cross-postings to on-topic newsgroups. Cross-posting is not expensive
(unless you have a dran-bamaged newsreader).
But thanks for giving me the benefit of the doubt.

Cutting off your audience, particularly those who *you* have responded to
is rude. Sorry if I've ruffled your feathers!
 
David said:
The scaling advantage comes largely from the architecture of a single
processor. The memory controller is on the chip. The main reason this
matters is that it means that local memory accesses don't have to content
with any other inter-CPU or I/O traffic.

That's only partly true. The Opterons still talk to each other even on local
accesses (coherency tokens only, no real data transfer). This takes both
time and adds to the traffic, since such a token needs to get everywhere.

What's missing here is a "exclusive" bit in the page table, for non-coherent
pages. The OS pretty well knows (or can know) which core is accessing a
page, and for a page that's not shared, the coherency token is not
necessary.
 
In comp.arch David Schwartz said:
In typical Opteron setups (2-8 CPUs, using the Opteron's build
in SMP hardware), the latency difference between local and remote
memory accesses is so small that the benefits of treating it as NUMA
are typically outweighed by the costs.

SPECweb99_SSL is probably atypical then (Yes, one of my favorite
benchmarks :) - the evolution of the tunes for Opteron systems on that
benchmark show the size of the Zeus tuanble "cache_small_file"
increasing to 90000 bytes. That brings many more of the URLs into the
"malloc" cache of Zeus where they are replicated per Zeus instance and
in this case then per-CPU (things being bound to CPUs) "Normal"
practice is to have cache_small_file be "NBPG"/numCPU to optimize the
memory comsumption.

It all depends of course:) Maybe that wasn't done for latency but to
cut-down the bandwidth consumed. Who knows - although I am interested
in trying to find-out :)
Generally, you just distribute the memory evenly and interleaved on
the nodes (if you can) to avoid overloading one memory controller
channel.

FWIW, I've noticed that Node interleave is (or seems to be, it was set
that way on the first one I saw and had no indication from the source
that it had been altered) disabled by default on the Sun V20z's.
Anyone have data on how Node interleave defaults on other
Opteron-based systems?

rick jones
 
Rick Jones said:
FWIW, I've noticed that Node interleave is (or seems to be, it was set
that way on the first one I saw and had no indication from the source
that it had been altered) disabled by default on the Sun V20z's.
Anyone have data on how Node interleave defaults on other
Opteron-based systems?

It defaults to "off" on Penguin systems, too.

scott
 
Rick Jones said:
FWIW, I've noticed that Node interleave is (or seems to be, it was set
that way on the first one I saw and had no indication from the source
that it had been altered) disabled by default on the Sun V20z's.
Anyone have data on how Node interleave defaults on other
Opteron-based systems?

As far as I know it's disabled by default on most shipping Opteron
servers. Only a few build-it-yourself dual motherboards have it
enabled by default.

For Linux use i would recommend to always disable it. The modern
kernel can do page interleaving on demand (with numactl or libnuma),
which is nearly as good, and most programs seem to just prefer
good memory latency.

-Andi
 
benchmark 1 3.71 3.03 + 22 %
benchmark 2 3.76 3.29 + 14 %
benchmark 3 3.78 3.26 + 16 %
benchmark 4 3.79 3.45 + 10 %
benchmark 5 3.92 3.89 + 1 %
benchmark 6 3.88 3.71 + 5 %

These benchmarks were run with the best Opteron compiler, so this
scaling improvement was very good to see. And it's bigger than
"usually less than 10%".

Averages out to 11 % .

Sounds like "usually less than 10%" may be right when talking about non scientific workloads.
 
As the posting in question was a text posting, this means that the
newsreader would have to guess at what constituted an URL, as well, with
no doubt occasional hilarious results.

Sorry, you dont make sense.
You really should get a decent newsreader.
 
Sorry, you dont make sense.
You really should get a decent newsreader.

Hmmm, I alwasy though Agent was fairly good. Perhaps yours can't show
headers? ...oh, another emacs bigot.
 
keith said:
Hmmm, I alwasy though Agent was fairly good. Perhaps yours can't show
headers?

I used Agent for some years untill it's limitations became irritating.
...oh, another emacs bigot.

It is a matter of using the right tool for the job.
Emac's mail/news sub-system, Gnus is superb.
 
Hmmm, I alwasy though Agent was fairly good. Perhaps yours can't show
headers? ...oh, another emacs bigot.

Well jsavard is using an *old* version of Free Agent but even the 1.93 I'm
using doesn't have a right click and "Save Link Target As.." I dunno what
the big deal is on either side here - copy/paste of a URL is always coming
up as a nuisance for file downloads, especially with the Adobe reader 6.0
being so damned slow to get started - the plugin has to load its err,
plugins to get started and then you also have to have it configured to turn
off "fast web view" to get the whole document without paging through the
bugger... all a royal PITA.

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??
 
snip
Well jsavard is using an *old* version of Free Agent but even the 1.93 I'm
using doesn't have a right click and "Save Link Target As.." I dunno what
the big deal is on either side here - copy/paste of a URL is always coming
up as a nuisance for file downloads, especially with the Adobe reader 6.0
being so damned slow to get started - the plugin has to load its err,
plugins to get started and then you also have to have it configured to
turn
off "fast web view" to get the whole document without paging through the
bugger... all a royal PITA.

This is somewhat off topic, but there is a simple fix for the "plug-in"
problem. Check out the adobe reader speedup at

http://www.tnk-bootblock.co.uk/prods/misc/index.php

It takes a few seconds to run and makes a noticable difference in the load
times from then on.
 
George said:
Well jsavard is using an *old* version of Free Agent but even the 1.93 I'm
using doesn't have a right click and "Save Link Target As.." I dunno what
the big deal is on either side here - copy/paste of a URL is always coming
up as a nuisance for file downloads, especially with the Adobe reader 6.0
being so damned slow to get started - the plugin has to load its err,
plugins to get started and then you also have to have it configured to turn
off "fast web view" to get the whole document without paging through the
bugger... all a royal PITA.

I've switched over to Thunderbird now for all mail and news (except
binary news). Agent is still the one to use for binary news. Outlook
Express is still the one to use to gather dust. :-)

Yousuf Khan
 
snip


This is somewhat off topic, but there is a simple fix for the "plug-in"
problem. Check out the adobe reader speedup at

http://www.tnk-bootblock.co.uk/prods/misc/index.php

It takes a few seconds to run and makes a noticable difference in the load
times from then on.

Thanks - sad that we need this stuff but.........

Rgds, George Macdonald

"Just because they're paranoid doesn't mean you're not psychotic" - Who, me??
 
Back
Top