gcc binary performance on amd64

  • Thread starter Thread starter Kenneth Massey
  • Start date Start date
K

Kenneth Massey

Here is a post I had to the g++ newsgroup, and it has been reported back that the performance
lag doesn't happen on the Pentium architectures. Maybe somebody here can try it on their AMD64.

------------

I was noticing significantly worse performance in some of my C++ codes compiled with gcc 3.4.3
as compared to gcc 3.3.4. I have boiled it down into one relatively short code that illustrates.
It seems to be an issue of excessive cache misses in certain pointer lookup operations in gcc
3.4.3 binaries. BTW, are there any tools to actually count cache misses?

If anyone has a few minutes to compile and run the following code, I would be interested in
knowing if you experience the same problems. I'm running AMD64 athlon 3200 with 1024KB cache. I
compiled with

g++ -O3 -Wall -march=k8

Compiled with gcc 3.3.4 average run time: 2.0 seconds
Compiled with gcc 3.4.3 average run time: 2.9 seconds

I've noticed even more dramatic differences in larger codes that actually do something.

I would be interested in answering the following questions:

1) is this observed only on AMD64, or also x86 ?
2) how does gcc 4.0.0 do?
3) are there compiler options that would improve performance (none that I've tried did)
4) what changed between gcc 3.3 and 3.4 to cause this?

If you have any spare time, I think this is an interesting example, and worth the effort for
someone to figure out. I'm afraid my compiler expertise is not sufficient, so I am asking for
some help. Thanks.



Code:

// run time is anywhere from 33 to 50 % longer when compiled with gcc 3.4.3 compared to 3.3.4
// compiled with g++ -O3 -Wall -march=k8 (same performance lag observed with -O2)
//
// Objects are created in a heirarchy of classes.
// When referenced, it seems that the pointer lookups
// must cause more cache misses in gcc 3.4.3 binaries.

#include <stdio.h>
#include <vector>

class mytype_A {
public:
int id;
mytype_A():id(0) {}
};

class mytype_B {
public:
mytype_A* A;
mytype_B(mytype_A* p):A(p) {}
};

class mytype_C {
public:
mytype_B* B;
mytype_C(mytype_B* p):B(p) {}
};


class mytype_D {
public:
// mytype_C* C[2]; // less performance difference if we use simple arrays
std::vector<mytype_C*> C;
int junk[3]; // affects performance (must cause cache misses)

public:
mytype_D(mytype_A* a0, mytype_A* a1) {
// C[0] = new mytype_C(new mytype_B(a0));
// C[1] = new mytype_C(new mytype_B(a0));
C.push_back(new mytype_C(new mytype_B(a0)));
C.push_back(new mytype_C(new mytype_B(a0)));
}
};



int main() {
int k = 5000; // run-time not linear in k
mytype_A* A[k];
mytype_D* D[k];
for (int i=0;i<=k;i++)
A = new mytype_A();
for (int i=0;i<k;i++)
D = new mytype_D(A,A[k-i]); // intentionally make some pointers farther apart

clock_t before = clock();

int k0 = 0;
for (int i=0;i<k;i++) {
k0 = 0;
for (int j=0;j<k;j++) { // run through list of D's, and reference pointers
mytype_D* d = D[j];
if (d->C[0]->B->A->id) k0++;
if (d->C[1]->B->A->id) k0++;
}
}
printf("%d\n",k0); // don't allow compiler to optimize away k0

printf("time: %f\n",(double)(clock()-before)/CLOCKS_PER_SEC);

return 0;
}
 
Here is a post I had to the g++ newsgroup, and it has been reported back that the performance
lag doesn't happen on the Pentium architectures. Maybe somebody here can try it on their AMD64.

------------

I was noticing significantly worse performance in some of my C++ codes compiled with gcc 3.4.3
as compared to gcc 3.3.4. I have boiled it down into one relatively short code that illustrates.
It seems to be an issue of excessive cache misses in certain pointer lookup operations in gcc
3.4.3 binaries. BTW, are there any tools to actually count cache misses?

If anyone has a few minutes to compile and run the following code, I would be interested in
knowing if you experience the same problems. I'm running AMD64 athlon 3200 with 1024KB cache. I
compiled with

g++ -O3 -Wall -march=k8

Compiled with gcc 3.3.4 average run time: 2.0 seconds
Compiled with gcc 3.4.3 average run time: 2.9 seconds

I've noticed even more dramatic differences in larger codes that actually do something.

I would be interested in answering the following questions:

1) is this observed only on AMD64, or also x86 ?
2) how does gcc 4.0.0 do?
3) are there compiler options that would improve performance (none that I've tried did)
4) what changed between gcc 3.3 and 3.4 to cause this?

If you have any spare time, I think this is an interesting example, and worth the effort for
someone to figure out. I'm afraid my compiler expertise is not sufficient, so I am asking for
some help. Thanks.



Code:

// run time is anywhere from 33 to 50 % longer when compiled with gcc 3.4.3 compared to 3.3.4
// compiled with g++ -O3 -Wall -march=k8 (same performance lag observed with -O2)
//

You do know that gcc 4.0.0 has been released and that it has a much
better optimizer.

http://gcc.gnu.org/gcc-4.0/changes.html

General Optimizer Improvements

* The tree ssa branch has been merged. This merge has brought in a
completely new optimization framework based on a higher level
intermediate representation than the existing RTL representation.
Numerous new code transformations based on the new framework are
available in GCC 4.0, including:
o Scalar replacement of aggregates
o Constant propagation
o Value range propagation
o Partial redundancy elimination
o Load and store motion
o Strength reduction
o Dead store elimination
o Dead and unreachable code elimination
o Autovectorization
o Loop interchange
o Tail recursion by accumulation

Many of these passes outperform their counterparts from previous
GCC releases.
 
You do know that gcc 4.0.0 has been released and that it has a much
better optimizer.

Theoretically. It seems to do pretty well so far on PPC, but
early benchmarks on AMD x86-64 and Intel show that gcc 3.4.x
is faster. I'm sure that will change later.
 
Back
Top