A Map to the New World

  • Thread starter Thread starter Felger Carbon
  • Start date Start date
F

Felger Carbon

There are some very astute techies who follow this NG and who
regularly contribute to it. There's also a lot of folk who mostly
lurk and, hopefully, learn. This writeup is intended for this latter
group.


I love show & tell. Let's make a computer core: we'll need some
checkers (round), dominoes (rectangular) and a large kiddie block (a
cube).

First we place the cube on a tabletop. The whole tabletop is our die
(the chip). Now we start in the middle of the cube and start building
a fence, bending it around so that it's about a foot in diameter and
returns to the opposite face of the cube. This produces two tabletop
spaces - inside the fence, and outside.

The cube is our computer control center. Outside the fence, but on
the tabletop, is all the memory and I/O including the caches. Inside
the fence is our logic core. The control center exists in both
worlds, and connects them.

OK, let's take about 16 checkers and stack them up. This is our
computer execution pipeline, which does all the work. Looks like a
smokestack, doesn't it? Put it in the middle of the fenced area. Now
we need something to hold the machine state: a register set, program
and stack pointers etc. We'll represent this with a dominoe, which
we'll place inside the fence alongside the execution pipeline. Voila!
A CPU core!

Note that this represents the 486 core and the Opteron core equally
well. BUT. It does **not** represent the P4 core. Why?
Hyperthreading.

The hyperthreaded P4 maintains 4 copies of the computer state. So, we
place three more dominoes around the execution pipe; four altogether.
Only one "dominoe" is active at any one time. Which dominoe is active
depends on which thread is being executed. At any given time, there
are three inactive dominoes. Only one thread can be executed by the
P4 at any one time.

Is this SMT, **Simultaneous** MultiTasking? Not in my book, it isn't.

What it is is a way to give the execution pipeline something to do
(another thread) when one thread stalls due to a cache miss. This can
change the performance of the P4, and in general it does (when used).
The change can be from about -7% to +25%. On average, the change is
positive (an improvement).

THIS IS IMPORTANT: Hyperthreading (and SMT in general) is both a
hardware and a software technique. When in use, that "control center"
block becomes more generalized as it includes both the hardware _and_
the OS (software), plus the application itself.

Building a True SMT Chip
------------------------

Simplicity itself. Start with our P4 representation, with four
dominoes. Spread them apart, and add three more execution pipelines,
so that each dominoe has its own pipeline. This is CMP, chip
multiprocessing. It is capable of true SMT; four threads can be
executed simultaneously.

This example has four cores on-chip. AMD and now Intel have announced
dual-core chips while IBM has been there for a while. Sun has just
announced an 8-core die as a "throughput" machine.

But Wait! We're Not Done Yet!
------------------------------

Let's Hyperthread-enable each of our four cores. We place 4 dominoes
around each of the four execution stacks. That's 16 dominoes all
told. Our chip can now execute 16 threads, four of them
simultaneously. Is this an SMT chip? Definitely yes. Can all its
threads be executed simultaneously? Definitely no.

Gentlemen, I present the future. In the future, the CPU die will
contain multiple cores, and each core will hold several copies of that
core's machine state.

Have we lost anything? Yes. We've lost the ability to focus all the
chip's compute resources on a singe user's single-threaded
application. Technically, this isn't much of a disadvantage; most
legacy apps run much faster than needed already. BUT: IMNSHO there's
a HUGE marketing problem here.

If you think some folk are hanging onto their old CPUs too long
already, when a new CPU would in fact run legacy apps faster, wait
until the new era when a new CPU will NOT run legacy apps any faster
than the old, obsolescent CPU!

I don't think Intel has thought this part out. ;-(
 
legacy apps run much faster than needed already. BUT: IMNSHO there's
a HUGE marketing problem here.

If you think some folk are hanging onto their old CPUs too long
already, when a new CPU would in fact run legacy apps faster, wait
until the new era when a new CPU will NOT run legacy apps any faster
than the old, obsolescent CPU!

hmm.... but wouldn't legacy apps still see a performance increase from
going from last years' 2 core 1.5Ghz model to this year's 4 core 2Ghz
model?

--
L.Angel: I'm looking for web design work.
If you need basic to med complexity webpages at affordable rates, email me :)
Standard HTML, SHTML, MySQL + PHP or ASP, Javascript.
If you really want, FrontPage & DreamWeaver too.
But keep in mind you pay extra bandwidth for their bloated code
 
The little lost angel said:
hmm.... but wouldn't legacy apps still see a performance increase from
going from last years' 2 core 1.5Ghz model to this year's 4 core 2Ghz
model?

Sure. Not to mention the fact that the 4-core model may well have 3mb of L2
cache.

Further, the move to needing multithreading to get the full value out of a
CPU has already allegedly happened, with the Hyperthreading-enabled models
of the P4s. With multiple-core CPUs, it's just a matter of degree.
 
The little lost angel said:
hmm.... but wouldn't legacy apps still see a performance increase from
going from last years' 2 core 1.5Ghz model to this year's 4 core 2Ghz
model?

The problem is that tomorrow's 2-core CPU will not run today's
single-threaded apps faster than today's 1-core CPU.

Multi-core CPUs are a huge win for almost all servers. What the hell
they'll be doing on my single-user private party desktop, I dunno.
 
Felger Carbon said:
Multi-core CPUs are a huge win for almost all servers. What the hell

If they're running bloatware. But a fileserver, newserver
or popserver should be I/O (disk & network) limited.
A webserver with lots of scripts might wind up compute limited.
AFIAK a Google-style search engine is memory bandwidth-limited.
they'll be doing on my single-user private party desktop, I dunno.

Me neither, unless some must-have compute-intensive software
comes along. 'Til then, I like my 1999 vintage dual Celeron.

-- Robert
 
Robert said:
If they're running bloatware. But a fileserver, newserver
or popserver should be I/O (disk & network) limited.

OLTP tends to stall alot. Nothing much you can do about it, apparently,
except to add more pipes and to let them stall. You can haggle over how
to add the pipes, like SMT or CMP, but those are details. In the end,
adding pipes is a win, up to a point, as long as you are putting
underutilized bandwidth to work. Beyond that point, you are just
wasting transistors on pipes that will stall. Working this all out is a
paycheck for somebody. :-).

The critical resouce for OLTP is bandwidth. Disk and I/O bandwidth are
shared and so are not affected by how you deploy processors, but memory
bandwidth most definitely is. If you have a single pipe and a single
memory controller, then the memory bandwidth that controller schedules
will be used in haphazard ways, and you will inevitably throw away
bandwidth. If you let multiple pipes share a single memory controller,
then, with luck, you will wind up with the maximum bandwidth utilization
possible.
A webserver with lots of scripts might wind up compute limited.

Ever seen any published evidence of that? Not a challenge. I'm
curious. I'd guess that you'd see spikes in processor utilization
interspersed with an idle processor and an average utilization no better
than the ~40% utilization that fully-loaded OLTP processors see, anyway.
AFIAK a Google-style search engine is memory bandwidth-limited.

When this subject came up in comp.arch, a poster claimed that Google
claimed (i.e., I can come up with a link to a comp.arch thread and
nothing more) that compute, disk, and memory were all roughly in balance
for their system. Would you expect any less from such a bright bunch?
Me neither, unless some must-have compute-intensive software
comes along. 'Til then, I like my 1999 vintage dual Celeron.

I went into a cataleptic state for ten seconds, consulted my private
oracle, and came out with the conclusion that programming styles,
compilers, or architectures would have to change so that more ordinary
applications could utilize multiple pipes, and that such a change is
inevitable. If my oracle says so, it must be true. :-).

So many things point that way that I do not see how things will be
otherwise. If we haven't reached a knee in the performance curve for a
single pipe, we're going to very soon. Graphics processors and thus
games already make abundant use of threaded programming.

I'd like to see programming for parallelism become safer and easier.
Having had that problem on the table for so long with what I see as very
little progress, I'm not optimistic. There are people for whom I have
profound respect working on this problem, but I'm not sure that they
don't underestimate it. People work at the surface with languages and
whatnot, when the real problem is that we lack even the most basic tools
for talking about actual algorithms in a formal way. People are working
at that level, too. For the most part, though, no one pays attention.

Compilers for parallelism have also seen alot of work and went through a
period when a tremendous amount of money was pumped into them. 'Nuff said.

In the face of this record of disappointment, hardware architects have
made what I see as substantial progress on the problem while addressing
the memory latency problem. If you want something done, ask a busy
person to do it. OoO processors look through a rather large window in
an instruction stream (and it has to be large to hide memory latency)
and move forward whatever they can. As agressive as the strategies
currently in use are, they could be even more aggressive. An OoO
superscalar processor already parallelizes opportunistically, and it is
only a short step to handing a processor a binary compiled for
single-threaded execution and watching it execute multi-threaded. It
will happen. I heard it straight from my own private oracle. :-).

RM
 
Robert Myers said:
I went into a cataleptic state for ten seconds, consulted
my private oracle, and came out with the conclusion that
programming styles, compilers, or architectures would
have to change so that more ordinary applications could
utilize multiple pipes, and that such a change is
inevitable. If my oracle says so, it must be true. :-).

Gosh, I hope you make a full recovery, Robert. How else will I learn
what the future holds? ;-)

But in the past - starting a microsecond ago - all my software was
single-threaded. Past software is legacy software. Future software
is not legacy software. I stated (didn't I?) that the future
dual-core CPUs would not improve the performance of (single-threaded)
**legacy** applications.

In other words, Robert, you changed the subject from legacy software
(that's the stuff we already own) to future software (which none of us
own). This means to benefit from future dual-core CPUs we will also
have to buy new software? Hmm. New hardware **and** new software?
Sounds like we'll all have to throw out what we have right now, today,
and buy all new stuff, both hardware and software.

I suggested this might be a huge marketing problem, as I recall.

I'm sure glad you're around. I would never have come to the
conclusion that we're all gonna trash what we have now! ;-) ;-)
 
Felger said:
Gosh, I hope you make a full recovery, Robert.

Even the most optimistic of my friends gave up on that thought long ago.
How else will I learn what the future holds? ;-)

I wouldn't be missed. No shortage of fortune tellers on Usenet. ;-).
But in the past - starting a microsecond ago - all my software was
single-threaded. Past software is legacy software. Future software
is not legacy software. I stated (didn't I?) that the future
dual-core CPUs would not improve the performance of (single-threaded)
**legacy** applications.

In other words, Robert, you changed the subject from legacy software
(that's the stuff we already own) to future software (which none of us
own).

The implication being that the software we own can't be used on a
radically different hardware. Always an incorrect conclusion for Linux
users who can just recompile. Transmeta has some ideas of its own, and
the most likely line of development I see for aggressively scheduled SMT
cores wouldn't need new software, either.
This means to benefit from future dual-core CPUs we will also
have to buy new software? Hmm. New hardware **and** new software?
Sounds like we'll all have to throw out what we have right now, today,
and buy all new stuff, both hardware and software.

Ah, but you see, onboard scheduling hardware already evokes parallelism
from a nominally single-threaded instruction stream. The parallelism
frequently is there, whether you are accustomed to diagramming it that
way (or whatever mental way you have of thinking of parallel processes)
or not. On-board scheduling hardware, among other things, discovers and
implements streaming parallelism, although not always without a struggle.

On board scheduling hardware could initiate new threads where there is
exploitable parallelism. If you're not _sure_ the parallelism is there,
you can speculate, often successfully. In the most naive of strategies,
you just pick places to jump into the instruction stream and start a new
thread.

One place to read about this kind of stuff is Andy Glew's home page.
It's also been a subject, one way or another, of a fair number of my
posts to comp.arch. Dick Wilmot has a particularly aggressive scheme
that he calls data surfing. Since I don't know any better, I think of
Andy Glew as the leading proponent of this particular set of tactics.

If you google comp.arch for "dusty decks" over the last year, you will
find more than one thread talking about feeding, er, legacy software to
a hungry multi-threaded monster.
I suggested this might be a huge marketing problem, as I recall.

If Intel decides that persuading people that "legacy" software is
something they don't need or want is in their best interest, they'll
find a way to market it. I would suspect that things are a bit tense
between the two halves of the Wintel monopoly just about now.

As it is, I don't think Intel will need any spectacular marketing ploys,
because my most likely scenario is that hardware will manage to
accommodate legacy software, anyway.
I'm sure glad you're around. I would never have come to the
conclusion that we're all gonna trash what we have now! ;-) ;-)

It's always nice to feel appreciated. Thank you. :-).

I would suspect that most performance-sensitive "legacy" applications,
meaning really Windows applications, have been gradually retuned and
recompiled from source as it is. You think you don't own software that
doesn't know about MMX, SSE, SSE2, and the vagaries of the Pentium 4?
If you don't, you haven't bought software in a long time.

RM
 
The problem is that tomorrow's 2-core CPU will not run today's
single-threaded apps faster than today's 1-core CPU.

Multi-core CPUs are a huge win for almost all servers. What the hell
they'll be doing on my single-user private party desktop, I dunno.


Geez, Felg. How many times do I have to tell you that WinBlows
isn't all there is to computing! Sometimes people have multiple
things going on at once! ;-)

SMP is Billy's best chance to actually have a multitasking OS!
....although others figured out how to do it on a UP long ago.
 
Back
Top