Terje Mathisen said:
Rather the opposite: The extra GPRs is a relatively small improvement,
but as you said, it made a lot of sense to do it at the same time.
I recently helped a friend with perfomance optimization of some string
searching code. They were using strstr, which was acceptably fast, but
needed a case insensitive version and the case insensitive versions of
strstr were much slower. The platform where this was discovered was
Win32 using VC++ libraries, but Linux, Solaris, AIX, HP-UX, etc. are
also targets for the product.
I suggested using Boyer-Moore-Gosper (BMG) since the search string is
applied to a very large amount of text. A fairly straight forward BMG
implementation (including case insensitivity) in C is ~3 times faster
than strstr on PowerPC and SPARC for the test cases they use. On PIII
class hardware it is faster than strstr by maybe 50%. On a P4 it is a
little slower than strstr.
The Microsoft strstr is a tight hand coded implementation of a fairly
simple algorithm: fast loop to find the first character, quick test
for the second character jumping to strcmp if success, back to first
loop if failure.
The BMG code had ridiculous average stride. Like 17 characters for
this test corpus. But the inner loop did not fit in an x86 register
file. (It looked like I might be able to make it fit if I hand coded
it and used the 8-bit half registers, and took over BP and SP, etc.
But it wasn't obvious it could be done and at that point, they were
happy enough with the speedup it did give so...)
It turns out that the P4 core can run the first character search loop
very fast. (I recall two characters per cycle, but I could be
misremembering. It was definitely 1/cycle.) The BMG code, despite
theoretically doing a lot less work, stalls a a bunch. A good part of
this is that it has a lot of memory accesses that are dependent on
loads of values being kept in memory because there are not enough
registers. 16 registers is enough to hold the whole thing. (I plan on
redoing the performance comparison on AMD64 soon.)
The point is this: having "enough" registers provides a certain
robustness to an architecture. It allows register based calling
conventions and avoids a certain amount of "strange" inversions in the
performance of algorithms. (Where the constant factors are way out of
whack compared to other systems.) As a practical day to day matter, it
makes working with the architecture easier. (Low-level debugging is
easier when items are in registers as well.)
I expect there are power savings to be had avoiding spill/fill
traffic, but I'll leave that to someone with more knowledge of the
hardware issues.
AMD64 is a well executed piece of practical computer architecture
work. Much more in touch with market issues than IPF ever has been or
ever will be.
-Z-