F
Felger Carbon
There are some very astute techies who follow this NG and who
regularly contribute to it. There's also a lot of folk who mostly
lurk and, hopefully, learn. This writeup is intended for this latter
group.
I love show & tell. Let's make a computer core: we'll need some
checkers (round), dominoes (rectangular) and a large kiddie block (a
cube).
First we place the cube on a tabletop. The whole tabletop is our die
(the chip). Now we start in the middle of the cube and start building
a fence, bending it around so that it's about a foot in diameter and
returns to the opposite face of the cube. This produces two tabletop
spaces - inside the fence, and outside.
The cube is our computer control center. Outside the fence, but on
the tabletop, is all the memory and I/O including the caches. Inside
the fence is our logic core. The control center exists in both
worlds, and connects them.
OK, let's take about 16 checkers and stack them up. This is our
computer execution pipeline, which does all the work. Looks like a
smokestack, doesn't it? Put it in the middle of the fenced area. Now
we need something to hold the machine state: a register set, program
and stack pointers etc. We'll represent this with a dominoe, which
we'll place inside the fence alongside the execution pipeline. Voila!
A CPU core!
Note that this represents the 486 core and the Opteron core equally
well. BUT. It does **not** represent the P4 core. Why?
Hyperthreading.
The hyperthreaded P4 maintains 4 copies of the computer state. So, we
place three more dominoes around the execution pipe; four altogether.
Only one "dominoe" is active at any one time. Which dominoe is active
depends on which thread is being executed. At any given time, there
are three inactive dominoes. Only one thread can be executed by the
P4 at any one time.
Is this SMT, **Simultaneous** MultiTasking? Not in my book, it isn't.
What it is is a way to give the execution pipeline something to do
(another thread) when one thread stalls due to a cache miss. This can
change the performance of the P4, and in general it does (when used).
The change can be from about -7% to +25%. On average, the change is
positive (an improvement).
THIS IS IMPORTANT: Hyperthreading (and SMT in general) is both a
hardware and a software technique. When in use, that "control center"
block becomes more generalized as it includes both the hardware _and_
the OS (software), plus the application itself.
Building a True SMT Chip
------------------------
Simplicity itself. Start with our P4 representation, with four
dominoes. Spread them apart, and add three more execution pipelines,
so that each dominoe has its own pipeline. This is CMP, chip
multiprocessing. It is capable of true SMT; four threads can be
executed simultaneously.
This example has four cores on-chip. AMD and now Intel have announced
dual-core chips while IBM has been there for a while. Sun has just
announced an 8-core die as a "throughput" machine.
But Wait! We're Not Done Yet!
------------------------------
Let's Hyperthread-enable each of our four cores. We place 4 dominoes
around each of the four execution stacks. That's 16 dominoes all
told. Our chip can now execute 16 threads, four of them
simultaneously. Is this an SMT chip? Definitely yes. Can all its
threads be executed simultaneously? Definitely no.
Gentlemen, I present the future. In the future, the CPU die will
contain multiple cores, and each core will hold several copies of that
core's machine state.
Have we lost anything? Yes. We've lost the ability to focus all the
chip's compute resources on a singe user's single-threaded
application. Technically, this isn't much of a disadvantage; most
legacy apps run much faster than needed already. BUT: IMNSHO there's
a HUGE marketing problem here.
If you think some folk are hanging onto their old CPUs too long
already, when a new CPU would in fact run legacy apps faster, wait
until the new era when a new CPU will NOT run legacy apps any faster
than the old, obsolescent CPU!
I don't think Intel has thought this part out. ;-(
regularly contribute to it. There's also a lot of folk who mostly
lurk and, hopefully, learn. This writeup is intended for this latter
group.
I love show & tell. Let's make a computer core: we'll need some
checkers (round), dominoes (rectangular) and a large kiddie block (a
cube).
First we place the cube on a tabletop. The whole tabletop is our die
(the chip). Now we start in the middle of the cube and start building
a fence, bending it around so that it's about a foot in diameter and
returns to the opposite face of the cube. This produces two tabletop
spaces - inside the fence, and outside.
The cube is our computer control center. Outside the fence, but on
the tabletop, is all the memory and I/O including the caches. Inside
the fence is our logic core. The control center exists in both
worlds, and connects them.
OK, let's take about 16 checkers and stack them up. This is our
computer execution pipeline, which does all the work. Looks like a
smokestack, doesn't it? Put it in the middle of the fenced area. Now
we need something to hold the machine state: a register set, program
and stack pointers etc. We'll represent this with a dominoe, which
we'll place inside the fence alongside the execution pipeline. Voila!
A CPU core!
Note that this represents the 486 core and the Opteron core equally
well. BUT. It does **not** represent the P4 core. Why?
Hyperthreading.
The hyperthreaded P4 maintains 4 copies of the computer state. So, we
place three more dominoes around the execution pipe; four altogether.
Only one "dominoe" is active at any one time. Which dominoe is active
depends on which thread is being executed. At any given time, there
are three inactive dominoes. Only one thread can be executed by the
P4 at any one time.
Is this SMT, **Simultaneous** MultiTasking? Not in my book, it isn't.
What it is is a way to give the execution pipeline something to do
(another thread) when one thread stalls due to a cache miss. This can
change the performance of the P4, and in general it does (when used).
The change can be from about -7% to +25%. On average, the change is
positive (an improvement).
THIS IS IMPORTANT: Hyperthreading (and SMT in general) is both a
hardware and a software technique. When in use, that "control center"
block becomes more generalized as it includes both the hardware _and_
the OS (software), plus the application itself.
Building a True SMT Chip
------------------------
Simplicity itself. Start with our P4 representation, with four
dominoes. Spread them apart, and add three more execution pipelines,
so that each dominoe has its own pipeline. This is CMP, chip
multiprocessing. It is capable of true SMT; four threads can be
executed simultaneously.
This example has four cores on-chip. AMD and now Intel have announced
dual-core chips while IBM has been there for a while. Sun has just
announced an 8-core die as a "throughput" machine.
But Wait! We're Not Done Yet!
------------------------------
Let's Hyperthread-enable each of our four cores. We place 4 dominoes
around each of the four execution stacks. That's 16 dominoes all
told. Our chip can now execute 16 threads, four of them
simultaneously. Is this an SMT chip? Definitely yes. Can all its
threads be executed simultaneously? Definitely no.
Gentlemen, I present the future. In the future, the CPU die will
contain multiple cores, and each core will hold several copies of that
core's machine state.
Have we lost anything? Yes. We've lost the ability to focus all the
chip's compute resources on a singe user's single-threaded
application. Technically, this isn't much of a disadvantage; most
legacy apps run much faster than needed already. BUT: IMNSHO there's
a HUGE marketing problem here.
If you think some folk are hanging onto their old CPUs too long
already, when a new CPU would in fact run legacy apps faster, wait
until the new era when a new CPU will NOT run legacy apps any faster
than the old, obsolescent CPU!
I don't think Intel has thought this part out. ;-(