Scott said:
The reverse. The GPU can't do loop unrolling, since it controls the
entire iteration through the matrix being processed (it's implied
looping, to be precise.) It was the AMD-64 for which I had to do the
manual unrolling.
gcc is not your friend.
More than 10 years ago, when I still was a student, one of the PhDs of the
numeric faculty made a matrix multiply competition for the HP-RISC CPUs we
had on our workstations. He estimated that 30MFLOPs would be possible, even
though a naive C loop could get less than 1MFLOP, and the HP Fortran
compiler with a build-in "extremely fast" matrix multiplication got no more
than 10MFLOPs.
After doing some experiments, I got indeed 30MFLOPs out of the thing, by
doing several levels of blocking. The inner loop kept a small submatrix
accumulator (as much as did fit, I think I got 5x5 into the registers), so
that several rows and columns could be multiplied together in one go
(saving a lot of loads and stores). The next blocking level was the (quite
large) cache of the PA-RISC machine, i.e. subareas of both matrixes where
multiplied together.
I never got around making the matrix multiplication routine general purpose
(the benchmark one could only multiply 512x512 matrixes), but today, this
sort of blocking is state of the art in high performance numerical
libraries. GCC isn't your friend, because loop unrolling here is really the
wrong approach. The inner loop I used just did all the multiplications for
the 5x5 submatrix, and no further unrolling was necessary.