I think most of the performance gains we are seeing with the
How would you tell the difference?
Easily. Most software use int, not "long long" or __uint64, so the
additional optimization opportunities compiler could possibly fathom from
the availability of wider registers is limited. The number of possibilities
to organize code so that it has less dependencies on previous instructions
w/o read/write to memory are more numerous, even if the L1 cache is 'fast',
and even if the code is dynamically translated to the 'micro-ISA' (for lack
of my mind to recall a better term for what is being done at this time) the
ALU's internally process.
Avarage speed differences from one batch of tests ran on AMD64 (X86_64 for
GCC terminology) did show up average 15% speed increase with mere
recompilation with early versions of GCC for X86_64. As time goes by, the
optimizations the compiler can employ are propably going to be increased,
but that's what early tests show. I can't recall the link, could be I picked
it from slashdot.org a while ago. Could remember wrong.
I don't have AMD64 at this time, but I _do_ develop for MIPS IV ISA and X86
at this time and have some little practical knowledge to come to the above
educated guess (yup, I don't claim it's the Truth or that I have the
ultimate clue, but base the opinion to previous experience in general in the
field of programming and micro-architechture and what effect is has on C/C++
code generation -- yes, I do check compiler output and adjust the sourcecode
accordingly to segments of code which are important, which are not very
many, but still puts me up to this particular job now and then even as of
2004).
The across-the-board winner, practically all problems, practically all
coding styles, has almost got to be reduced latency. In the same
category as increasing the size of the cache, only much better,
because you don't have to work so hard to suck stuff into the cache.
L2 cache avoids very expensive reads, but as long as the working set fits
into L2 anyway, the way L1 is implemented can have 1-2 clock cycle
improvement in latency for VERY *TOIGHT* innerloops. P4's latest incarnation
does appear to favour more (space) over less (latency), though.
Having cache efficiency is not just merely fitting the working set into the
cache, but for dataset which DOESN'T fit there is still all-important issue
about cache which DEFINITELY DOES NOT guarantee acorss-the-board win for all
coding styles. First, merely having cache does help, on average, that has
been demonstrated and that's why there IS cache on modern systems. But using
it in specific ways goes the extra mile. First, how many ways the cache has
has effect on how many working sets the code can use at one time w/o
performance dive. Secondly, the pattern of storing data can have a
substential effect on AVERAGE performance. FATMAP2 is a classic document on
the issue, showing how storage pattern can improve performance by average of
50% (and even more). Infact, tiling is a VERY COMMON pattern for storing
texels in modern 3D accelerators. It's not black magic or mystery why this
is being done. Same technique benefits applications written for
generic-purpose CPU.
More register names. My guess is that the biggest benefit is to
compiler writers and hand coders.
Hand coders are a dying breed, but coders who do think a bit what they want
the computer to do, rather than what they want abstract virtual language to
do, are sometimes the concept which separates a good implementation from the
bad. Some program "C++", some program machine code USING C++, even if they
use templates, partial specialization, namespaces, inheritance, etc.etc. A
subtle difference in theory, but can lead to more efficient code in
practise. However, dwelling too much on the performance issues ALL THE TIME
is waste of time, and therefore waste of money and brainpower, life and all
that naturally results from that. Whenever given choise, I'd write clear and
easy-to-read solution rather than obfuscated one even if it were 50% faster.
The chances are that this code is easier to refactor and take appropriate
measures to switch between algorithms which result ORDER OF MAGNITUDE better
performance in a long haul anyway, when wrote "slow code", but, it doesn't
mean I'd go writing code I know to be crappy just because it gets the job
done, the point I had in mind was that experience leads to automaticly
writing things the 'right way'..
Experience in programming is proficiency in applying 'patterns' you've
learned to practical problems you are solving at the time. Programming is
problem solving and pattern matching at the same time. I don't care if
someone disagrees this is just my opinion. And I am taking drugs and
unemployed so perhaps anyone should not take my advice afterall, see what it
would get you?
But I hope the regulars here got a good chucles out of this and I hope even
more satisfaction in claiming how clueless git I were. This one's on me.
Enjoy.