I think in the *real* world, that's less important. For single
threaded benchmarks, it is an advantage. But sharing your cache is a
huge advantage for stuff like specjbb, tpcc, or real world stuff where
you have data sharing between processors.
Depends on whose "real world" you're talking about. In PC applications,
the ability to share data is mainly useless, single-threaded performance
still rules the day. In servers, sure, shared data between threads is
useful.
If you look at the time it takes to acquire a lock from another
processor across the FSB (or even HT) versus from a shared cache,
you're talking an order of magnitude or more quicker.
In the case of HT, it's not so much locking the data that takes time, as
it is in copying the data from the remote processor over the links to
your own local processor cache that matters most.
Not really. There's quite a few people who think otherwise...
Different work definitions.
Sure it will. It will decrease HT/memory accesses. That's a big win.
Which is what I think I had said just below that:
I fail to see the distinction between the two situations (C2D and
Barcelona).
The Barcelona's shared L3 will be used mainly for "shared data between
cores" purposes, but not for "single-threaded overdrive" purposes. The
C2D having a shared L2, each of the cores accesses that L2 directly, so
therefore any single core can take over larger portions of the L2 as the
need arises, thus increasing single-thread performance, overdriving it
if you will.
The cores in Barcelona will not be reading anything directly in from L3,
it will all be first read from L3 to L1. The cores only ever read
directly from their own L1 or L2 in the AMD64 architecture. So in
Barcelona you can't simply have a core allocate a large portion of the
L3 if it needs to increase its single-thread performance. Granted,
something like that effect can occur but only in slow domino-effect
stages, as data overflows each lower-level cache and gets ejected into
the next higher level cache. Eventually after enough overflowage, a
really busy AMD64 core might be able to take over the entire L3 for
itself, just like a really busy C2D core could take over the full L2 for
itself.
In each cache, the last level of on-die cache is shared between
multiple processors, providing a performance advantage (usually).
I don't think in AMD64 there is a direct linkage between the cores and
the L3 cache. The cores only directly control L1 and L2, but L3 might be
an automatic catch basin for data that's overflowed those first two
caches. I could be wrong on this, and there might be a direct read pipe
from L3 to each core. In current AMD64, there is a similar mechanism
with the onboard memory controller. Data that has to be fetched from
system RAM is never read directly from the memory controller into the
core; instead it is read into the memory controller which then passes it
onto the L1 which then passes it to the core. The core may also
occasionally read directly out of the L2, if data is not found in L1.
Yousuf Khan