Let's wind the clock back to 1970 (perhaps 1960). We had the
same issues then. Caches weren't invented in 1990, as you
apparently believe.
What in that post would lead you to believe that I thought cache was
invented in 1990? The (implied) context was the world of
microprocessors, and the particular point of discussion was the
importance _cache_misses_.
CACHE misses, Keith. That's what we were talking about. Could anyone
grasp how catastrophic CACHE misses would become for performance and
how it would be handled in the long run. C-A-C-H-E misses.
To clear up any possible confusion about how well this issue was (not)
understood as late as the mid-90's, I went looking for references to
the term "compulsory cache miss". Among other things, I turned up
http://www.complang.tuwien.ac.at/anton/memory-wall.html
by Anton Ertl, a frequent and respected contributor to comp.arch. In
that document, he is taking aim at a famous paper, "Hitting the Memory
Wall: Implications of the Obvious" by Wulf et. al.
The belief at the time was that computing time would eventually be
dominated by compulsory cache misses, and Ertl's main beef was that
"Hitting the Memory Wall" made unwarranted assumptions about
compulsory cache misses. Even Prof. Ertl missed the point.
In order to make it completely, utterly, unalterably, unarguably,
transparently clear just *how* poorly the issue was understood at the
time, I am going to make an extended quote from the famous Wulf paper:
<begin quote>
To get a handle on the answers, consider an old friend the equation
for the average time to access memory, where t c and t m are the cache
and DRAM access times andp is the probability of a cache hit:
tavg = p*tc + (l-p) *tm
We want to look at how the average access time changes with
technology, so we'll make some conservative assumptions; as you'll
see, the specific values won't change the basic conclusion of this
note, namely that we are going to hit a wall in the improvement of
system performance unless something basic changes.
First let's assume that the cache speed matches that of the processor,
and specifically that it scales with the processor speed. This is
certainly true for on-chip cache, and allows us to
easily normalize all our results in terms of instruction cycle times
(essentially saying t c = 1 cpu cycle). Second, assume that the cache
is perfect. That is, the cache never has a conflict
or capacity miss; the only misses are the compulsory ones. Thus ( 1
-p) is just the probability of accessing a location that has never
been referenced before (one can quibble
and adjust this for line size, but this won't affect the conclusion,
so we won't make the argument more complicated than necessary).
Now, although ( 1 -p) is small, it isn't zero_ Therefore as t c and t
m diverge, tavg will grow and system performance will degrade. In
fact, it will hit a wall.
<end quote>
Simple, obvious, easy to state, easy to understand, and WRONG.
Today's computer programs hide many so-called compulsory misses by
finding something else for the processor to do while waiting for the
needed data to become available. What makes OoO so powerful is that
it can hide even compulsory misses, and people just didn't get it,
even by the mid-nineties. People were still thinking of cache in
terms of data re-use.
No one whose primary experience was on a Cray-1 type machine would
have made such a mistake, because there was no cache to miss, and
memory access latency was significant (eight cycles if I recall
correctly without checking). Cray-type machines had been hiding most
so-called compulsory misses ever since the machine went into
production, and with in-order execution, by the simple expedient of
pipelining.
The term compulsory cache miss can still be found in much more recent
references, e.g.,
http://www.extremetech.com/article2/0,3973,34539,00.asp
but I have no idea why people keep talking about compulsory misses,
because the concept has turned out not to be all that important.
The misses that count are not first reference misses, or compulsory
misses, but first reference misses that are made too late to avoid
stalling the pipeline. For a processor like Itanium, whether you can
make memory requests early enough to avoid stalling the pipeline
depends on how predictable the code is. For an OoO processor, whether
you can make requests early enough to avoid stalling the pipeline
depends on lots of things, including how aggressive you want to be in
speculation and how much on-die circuitry you are willing to commit to
instruction juggling.
The designers of Itanium bet that on-die scheduling would take too
many transistors and too much power to work well. They bet wrong, but
to say that the issues were well understood when they put their money
down is simply to ignore history.
RM