Pentium M to become THE CPU

  • Thread starter Thread starter Nathan Bates
  • Start date Start date
N

Nathan Bates

Pentium M has all the right ingredients for total world domination:
low power consumption, short pipeline stages, hi-performance.

Pentium M will kill its brother Pentium 4 and its bastard cousin
Athlon.
PowerPC is a Neanderthal that's nearing its end (Jobs figured that
out).
But ARM will survive due to its ultra-low power consumption and
elegance.
 
Nathan Bates said:
Pentium M has all the right ingredients for total world domination:
low power consumption, short pipeline stages, hi-performance.

FSB?

Casper
 
Pentium M has all the right ingredients for total world domination:
low power consumption, short pipeline stages, hi-performance.

Pentium M will kill its brother Pentium 4 and its bastard cousin
Athlon.
PowerPC is a Neanderthal that's nearing its end (Jobs figured that
out).
But ARM will survive due to its ultra-low power consumption and
elegance.

That seems to be a religious issue for you; I think you should
consider visiting a therapist.
 
Nathan said:
Pentium M has all the right ingredients for total world domination:
low power consumption, short pipeline stages, hi-performance.

Pentium M will kill its brother Pentium 4 and its bastard cousin
Athlon.
PowerPC is a Neanderthal that's nearing its end (Jobs figured that
out).
But ARM will survive due to its ultra-low power consumption and
elegance.

Sigh. Another troll to plonk.
 
Picks up at the same point the "Netburst" chips left off, and then pushes
higher...

No matter how far you push a Ford, you'll get out of it a Mercury or
at best a Lincoln, but still not even close to BMW (OK, everyone may
have personal preferences, but MSRP speaks for itself - $50,525
Lincoln Town Car vs. $121,295 BMW 760Li - both top trim full size
sedans, data from Edmunds.com)

The whole point is that A64/Opteron has _no_ FSB. No matter how fast
the FSB is, it can't beat on-chip memory controller. And in SMP the
fastest Intel FSB doesn't scale up as well as Opteron's point-to-point
HT links.

NNN
 
Nathan Bates said:
Pentium M has all the right ingredients for total world domination:
low power consumption, short pipeline stages, hi-performance.

Mediocre FP performance, few available motherboards, high price, no
SMP support?

For the price of a high end Pentium M, I can get a dual core AMD where
each core has equivalent integer performance and much better FP. Sure
Pentium M is attractive for some purposes, but total world domination
is still a way off, IMO.

-k
 
The whole point is that A64/Opteron has _no_ FSB. No matter how fast
the FSB is, it can't beat on-chip memory controller. And in SMP the
fastest Intel FSB doesn't scale up as well as Opteron's point-to-point
HT links.

Indeed; the FSB is just about fast enough for one core; it becomes
a bottleneck at two cores.

Casper
 
Casper H.S. Dik said:
Indeed; the FSB is just about fast enough for one core; it becomes
a bottleneck at two cores.

It's a bit more complicated I think:

First the memory controller no matter if integrated or not is a
bottleneck for any CPU given a sufficient fast workload. That's
simply because the DIMMs cannot keep up with the CPU.

I would say in practice for a normal desktop machine or a laptop
the limit is how much bandwidth two DIMMs can deliver.

For bandwidth a sufficiently fast FSB could supply enough bandwidth
to easily keep up with these two DIMMs.

Where it mainly loses against the integrated IMC+separate link is when there
is a lot of additional IO traffic too (but that tends to be small
compared to memory traffic except perhaps for 3d).

And in latency it is slower of course of course. That is the big win
of the integrated memory controller. Even that can vary though -
e.g. if the FSB has enough bandwidth and the chipset a good memory
controller it could look reasonable again under high load (compared
to idle latency)

For servers with multiple sockets, better IO and typically more DIMMs
that can deliver data in parallel it's a different chapter of course.
First sharing the FSB between multiple sockets is of course a
bottleneck, especially when the FSB isn't fast enough for even a
single dual core. And it also needs to carry additional processor
synchronization traffic. But then there is no rule that the FSB
has to be shared between multiple CPUs.

This only works for relatively small systems of course.

Given enough tweaks (higher frequency, split FSBs for multi socket
systems or even multiple cores on one socket) it might be some time
until the FSB setup runs really out of steam.

-Andi
 
Andi said:
First the memory controller no matter if integrated or not is a
bottleneck for any CPU given a sufficient fast workload. That's
simply because the DIMMs cannot keep up with the CPU.

I would say in practice for a normal desktop machine or a laptop
the limit is how much bandwidth two DIMMs can deliver.

90%+ of the time, the problem is NOT bandwidth, but Latency. The on-die
memory controller gets rid of all of the FSB (latency adding) cycles.

In Opteron, for example, the address associated with an L2 miss can
arrive at the memory controller in less than 2ns, and data arriving at
the pins from the DIMMs can arrive back at the CPU in a similar number.

ON a FSB system, the L2 miss has to get synchronized to the FSB bus,
travel over that bus, get registered in the Memory controler, get
scheduled, and have the address driven out to the DIMMs. A similar
process occurs on the way back. But the clincher is that the memory
controller is implemented in ASIC technology (think 500 MHz) rather
than CPU technology (think 3 GHz); so every little step of memory
controller processing is correspondingly slower.
For bandwidth a sufficiently fast FSB could supply enough bandwidth
to easily keep up with these two DIMMs.

Where it mainly loses against the integrated IMC+separate link is when there
is a lot of additional IO traffic too (but that tends to be small
compared to memory traffic except perhaps for 3d).

And in latency it is slower of course of course. That is the big win
of the integrated memory controller. Even that can vary though -
e.g. if the FSB has enough bandwidth and the chipset a good memory
controller it could look reasonable again under high load (compared
to idle latency)

If the processor waits at any point because DRAM data has not arrived
and the CPU has nothing left to try to do, then you are in a latency
bound situation and the FSB looses. More bandwidth does not speed up
latency bound problems.

In addition the on-die approach with the HyperTransport fabric
interconnect gives you the property that as you add CPUs, you also add
DRAM bandwidth and bisection bandwidth. A 4 Node Opteron system has ~4
times as much DRAM bandwidth as a 4 node Pentium (single) FSB system
and plenty of chip-to-chip bandwidth to route the data to where it is
needed.

Mitch
 
Indeed; the FSB is just about fast enough for one core;
it becomes a bottleneck at two cores.

I'm afraid you're a victim of a common misconception about the advantages
of a ccNUMA-architecture like that of the Opteron. The Opteron's NUMA is
less scalable like it looks like, because for every cache-line load from
the main-memory, the Opteron has to broadcast a snoop-message to *all* the
processors in the cc-domain (hopefully in parallel to the speculative load
from the memory) to check whether one of the processors has a more recent
version of this cacheline! With a shared FSB, every processor simply snoops
the cl-loads of the other CPUs and a processor satifies another CPU's burst
-request of a certain cacheline before the chipset satifies this load!
have a performance-advantage on cc-numa-architectures like
that of the Opteron. If you want to avoid this snoop-broadcasts, you would
have to connect all CPUs to a central crossbar that has duplicate tags for
every other CPU's cache; but that's a rather expensive technology.
 
Nathan said:
Pentium M has all the right ingredients for total world domination:
low power consumption, short pipeline stages, hi-performance.

I'm still banking on the 8051 - that thing just won't go down.

Kelly
 
First the memory controller no matter if integrated or not is a
bottleneck for any CPU given a sufficient fast workload. That's
simply because the DIMMs cannot keep up with the CPU.

Right, but you should be aware that there are two flavours of this
aspect: bandwidth and latency.
I would say in practice for a normal desktop machine or a
laptop the limit is how much bandwidth two DIMMs can deliver.

The problem is in most cases the latency; not the bandwidth.
And in latency it is slower of course of course.
That is the big win of the integrated memory controller.

Yes, that's the main-advantage of an integrated memory-controller.
Even that can vary though - e.g. if the FSB has enough bandwidth and
the chipset a good memory controller it could look reasonable again
under high load (compared to idle latency).

I don't think that a chipset memory-controller can keep up with an
integraded memory-controller in terms of latency.
First sharing the FSB between multiple sockets is of course a
bottleneck, ...

That's not that obvious as one might think:
Given enough tweaks (higher frequency, split FSBs for multi socket
systems or even multiple cores on one socket) it might be some time
until the FSB setup runs really out of steam.

I think that a technology which is common in large ccNUMA multiprocessor
systems will gain importance in PC-SMP-systems in the future: duplicate
tags in the chipset or attached to the chipset.
 
90%+ of the time, the problem is NOT bandwidth, but Latency. The on-die
memory controller gets rid of all of the FSB (latency adding) cycles.

Right! On my simple Athlon-XP 1400+ with a SiS-745-chipset, loading a
cacheline until the data is available in the CPU's register, takes about
320 clock-cycles!!!
In Opteron, for example, the address associated with an L2 miss can
arrive at the memory controller in less than 2ns, and data arriving at
the pins from the DIMMs can arrive back at the CPU in a similar number.

Yes, but you have to consider the speculative snoops to other CPUs in
the ccNUMA domain also!
But the clincher is that the memory controller is implemented in ASIC
technology (think 500 MHz) rather than CPU technology (think 3 GHz);

I don't believe that this is the major latency-factor here.
A 4 Node Opteron system has ~4 times as much DRAM bandwidth as a 4
node Pentium (single) FSB system and plenty of chip-to-chip bandwidth
to route the data to where it is needed.

It has about four times the store-bandwith - but not the load-bandwidth
due to speculative snoops.
 
90%+ of the time, the problem is NOT bandwidth, but Latency. The on-die
memory controller gets rid of all of the FSB (latency adding) cycles.

No argument on that latency is important, and the IMC wins on latency.

In practice a good single CPU P4 system with 800Mhz FSB and a good
memory controller has about twice as much memory latency as a A64
(~45ns vs ~90ns[1]).

I suspect if Intel cranks up the FSB frequency to 1Ghz or more and
possibly increase the frequency of their chipsets they can get that
down. So with some improvements they might get the latency down a bit
more (let's say only 30-40% penalty which they might make up with
other tricks like more cache) and have comparable or better bandwidth
(if they as usual surpass AMD in faster DRAM support), so it doesn't
look too bad for the next time for a single socket/dual core system at
least.

[1] Actually with lmbench a newer Intel dual core systems reports a lower
memory latency to me than on an A64, but I suspect their prefetch
algorithms became so good they broke lmbench ;-)
In Opteron, for example, the address associated with an L2 miss can
arrive at the memory controller in less than 2ns, and data arriving at
the pins from the DIMMs can arrive back at the CPU in a similar number.

.... if you don't have to wait for the cache probe responses of
the other CPUs.
In addition the on-die approach with the HyperTransport fabric
interconnect gives you the property that as you add CPUs, you also add
DRAM bandwidth and bisection bandwidth. A 4 Node Opteron system has ~4
times as much DRAM bandwidth as a 4 node Pentium (single) FSB system
and plenty of chip-to-chip bandwidth to route the data to where it is
needed.

Yes, for multi socket systems the Opteron NUMA setup is clearly a winner
right now.

-Andi (partly playing devil's advocate here)
 
You forgot to consider a major latency-factor: the cacheline-size. The
P4 has a stupid cacheline-size of 128 bytes (16 times the bus-width!)
in the L2- and L2-caches, whereas the P3, the Pentium-M and all Athlons
have a more reasonable cacheline-size of 64 bytes on all cache-levels.
 
Jens Meyer said:
You forgot to consider a major latency-factor: the cacheline-size. The
P4 has a stupid cacheline-size of 128 bytes (16 times the bus-width!)
in the L2- and L2-caches, whereas the P3, the Pentium-M and all Athlons
have a more reasonable cacheline-size of 64 bytes on all cache-levels.

A modern bus should do critical word first, so I wouldn't expect
this to be a large disadvantage (given they have enough bandwidth,
which they will probably have on a single socket systems)

-Andi
 
No argument on that latency is important, and the IMC wins on latency.
In practice a good single CPU P4 system with 800Mhz FSB and a good
memory controller has about twice as much memory latency as a A64
(~45ns vs ~90ns[1]).

And that's not mainly because of the FSB, but the double cache-line
length of the P4!
[1] Actually with lmbench a newer Intel dual core systems reports a
lower memory latency to me than on an A64, but I suspect their prefetch
algorithms became so good they broke lmbench ;-)

How should a prefetching-algorithm break a memory-latency benchmark?
Wenn we have the strongest latency-demands when doing pointer-chasing
and thereby disabling out-of-order execution, even hardware-scouting
won't help!
 
Jens Meyer said:
You forgot to consider a major latency-factor: the cacheline-size. The
P4 has a stupid cacheline-size of 128 bytes (16 times the bus-width!)
in the L2- and L2-caches, whereas the P3, the Pentium-M and all Athlons
have a more reasonable cacheline-size of 64 bytes on all cache-levels.

An interesting claim.

Some, if not most, variants of recent Pentium 4 (smithfield ?)
have L2 cache with 64-byte line size.

That said, I wouldn't call 128-byte line L2 as "stupid".
It depends on the workload, the cache size and memory bandwidth among others.
With 1MB or 2MB L2, I think there are enough number of L2 lines (2M/128 = 16k)
although it would be a problem if L2 were small (e.g. 128K L2
would mean measly 1k L2 lines if the line size is 128 bytes).
 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm still banking on the 8051 - that thing just won't go down.

8051? The 6502 pwnz j00. :-)

(One of these days, I'll build a controller for my beer fridges so I can
free up the Apple IIs that are currently running them (a IIGS on one and a
IIe on the other). To simplify the software-porting effort, it'll most
likely be built around a 6502, or something compatible with it. It's not
like monitoring the temperature and switching the compressor on and off
requires dual Opterons or something insane like that.)

_/_
/ v \ Scott Alfter (remove the obvious to send mail)
(IIGS( http://alfter.us/ Top-posting!
\_^_/ rm -rf /bin/laden >What's the most annoying thing on Usenet?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFDTUyNVgTKos01OwkRAsUhAJ9kN4nuuE4i+96oOxbhesnsJJXbRwCg0QhY
P7m0D2oFIgMO6s0E3MMzhEs=
=hAC1
-----END PGP SIGNATURE-----
 
Back
Top