Zalman said:
I recently helped a friend with perfomance optimization of some string
searching code. They were using strstr, which was acceptably fast, but
needed a case insensitive version and the case insensitive versions of
strstr were much slower. The platform where this was discovered was
Win32 using VC++ libraries, but Linux, Solaris, AIX, HP-UX, etc. are
also targets for the product.
I suggested using Boyer-Moore-Gosper (BMG) since the search string is
applied to a very large amount of text. A fairly straight forward BMG
implementation (including case insensitivity) in C is ~3 times faster
than strstr on PowerPC and SPARC for the test cases they use. On PIII
class hardware it is faster than strstr by maybe 50%. On a P4 it is a
little slower than strstr.
The Microsoft strstr is a tight hand coded implementation of a fairly
simple algorithm: fast loop to find the first character, quick test
for the second character jumping to strcmp if success, back to first
loop if failure.
The BMG code had ridiculous average stride. Like 17 characters for
this test corpus. But the inner loop did not fit in an x86 register
file. (It looked like I might be able to make it fit if I hand coded
it and used the 8-bit half registers, and took over BP and SP, etc.
But it wasn't obvious it could be done and at that point, they were
happy enough with the speedup it did give so...)
OK, this is fun!
I wrote an 8-bit, case-insensitive, BM search many years ago, and I
found it relatively easy to make it fit in 6-7 regs, even in the 16-bit
days.
Let's see... darn, I've lost the source code. :-(
It was something like this (with 486 timings):
; SI -> source string
; DI -> Search record, containing among other things the
; Skip distance and case conversion lookup tables, as well
; as a (possibly) monocased target string
next4:
mov bl,[si] ; 1 Load next char to test
mov bl,[di+bx] ; 1+1(AGI) Convert to skip distance
add si,bx ; 1 Increment source ptr
mov bl,[si] ; 2 Load next char to test
mov bl,[di+bx] ; 1+1(AGI) Convert to skip distance
add si,bx ; 1 Increment source ptr
mov bl,[si] ; 2 Load next char to test
mov bl,[di+bx] ; 1+1(AGI) Convert to skip distance
add si,bx ; 1 Increment source ptr
mov bl,[si] ; 2 Load next char to test
mov bl,[di+bx] ; 1+1(AGI) Convert to skip distance
add si,bx ; 1 Increment source ptr
test bx,bx ; 1 Did we get a match?
jnz next4 ; 3 No, so try again
; Check for
; Possible match, check remaining chars...
At this point I did have to spill/fill one register, it might have been
faster to do an other explicit test against the second-to-last target
character first.
Something like 10 years ago I figured out how to make it work for
case-insensitive 16-bit strings. There are two key ideas to make this work:
1) Use a _very_ simple hash table for the skip distance lookup, i.e. XOR
the top 6 bits into the lower 10, and use the resulting 0..1023 value as
an index into the skip table. For each location in that table, there
will be 64 collisions. Record the shortest possible skip distance for
any of these. This is normally still very close to the maximum skip
distance!
2) During setup, generate two versions of the target string, one
lowercased and the other uppercased. During a check for a possible match
you should, for each iteration of the match verification loop, test
against both versions of the string.
This avoids a possibly _very_ expensive 16-bit toupper() call for each
char loaded!
It turns out that the P4 core can run the first character search loop
very fast. (I recall two characters per cycle, but I could be
misremembering. It was definitely 1/cycle.) The BMG code, despite
1/cycle? Hmmm!
With one char/iteration, that's impossible:
next:
mov al,[esi] ; 1
inc esi ; 1
cmp al,bl ; 2
jne next ; 2
Unrolling makes it better:
next4:
movzx eax,word ptr [esi] ; Load 16 bits
cmp bl,al
je first_match
cmp bl,ah
je second_match
movzx eax,word ptr [esi+2] ; Load next 16 bits
add esi,4
cmp bl,al
je third_match
cmp bl,ah
jne next4
fourth_match:
Even better would be to load 32 bits and check all four characters at
the same time:
next4:
mov eax,[esi]
add esi,4
cmp bl,al
je first_match
cmp bl,ah
je second_match
shr eax,16
cmp bl,al
je third_match
cmp bl,ah
jne next4
fourth_match:
A case-insensitve version of the same kind of loop could also be written:
next:
mov al,[esi]
inc esi
cmp al,bl ; tolower(first_char)
je match
cmp al,bh ; toupper(first_char)
jne next
match:
mov al,[esi]
cmp al,dl ; tolower(second_char)
je match_pair
cmp al,dh
jne next
match_pair:
I have a feeling that a regular, case-sensitive, strstr() could work
quite well on x86 by using/abusing the ability to do misaligned loads
quite fast. I.e. only when straddling cache lines do you get a real penalty:
next4:
cmp ebx,[esi]
je match4_0
cmp ebx,[esi+1]
je match4_1
cmp ebx,[esi+2]
je match4_2
cmp ebx,[esi+3]
add esi,4
jne next4
match4_3:
sub esi,3
match4_2:
inc esi
match4_1:
inc esi
match4_0:
; ESI -> place where the first four characters all match,
; check the rest!
theoretically doing a lot less work, stalls a a bunch. A good part of
this is that it has a lot of memory accesses that are dependent on
loads of values being kept in memory because there are not enough
registers. 16 registers is enough to hold the whole thing. (I plan on
redoing the performance comparison on AMD64 soon.)
The inner loop will always fit in registers, but the data-dependent skip
distances means that all the loads suffer AGI stalls.
The point is this: having "enough" registers provides a certain
robustness to an architecture. It allows register based calling
conventions and avoids a certain amount of "strange" inversions in the
performance of algorithms. (Where the constant factors are way out of
whack compared to other systems.) As a practical day to day matter, it
makes working with the architecture easier. (Low-level debugging is
easier when items are in registers as well.)
I expect there are power savings to be had avoiding spill/fill
traffic, but I'll leave that to someone with more knowledge of the
hardware issues.
AMD64 is a well executed piece of practical computer architecture
work. Much more in touch with market issues than IPF ever has been or
ever will be.
That is something we can agree on!
(Or as I told our Dell representative this spring, by coincidence the
day before Intel caved in and announced their x86-64 version: "Dell has
to deliver 64-bit x86 product within six months or so, or you'll be
history.")
A C(++) version of the 16-bit (possibly) case-insensitive BM:
do {
do {
c = [src];
c = hash(c); // #define hash(c) (c & 1023)
skip = skiptable[c];
src += skip;
} while (skip);
// Assume we placed a sentinel at the end!
if (src >= source_end) return NULL;
// Since the skip table test can have false positives, we
// must check the entire string here!
s = src;
l = target_len;
tu = target_upper_end;
tl = target_lower_end;
do {
c = [s--];
if (c != *tu) && (c != *tl) break;
tu--; tl--;
} while (--l);
if (! l) return (s+1);
src++;
} while (src < source_end);
When doing a case-sensitive search, you simply allocate a single buffer
for the target string, and set both tu and tl to point to the last
character in it.
Terje
PS. When the buffer to be searched can be shared or read-only, then it
is faster to make two versions of the code above, with the first using
an unrolled inner loop that checks that it stays within (UNROLL_FACTOR *
MAX_STRIDE) of the end, and the second version doing an explicit check
on each iteration.