K
Kenneth Massey
Here is a post I had to the g++ newsgroup, and it has been reported back that the performance
lag doesn't happen on the Pentium architectures. Maybe somebody here can try it on their AMD64.
------------
I was noticing significantly worse performance in some of my C++ codes compiled with gcc 3.4.3
as compared to gcc 3.3.4. I have boiled it down into one relatively short code that illustrates.
It seems to be an issue of excessive cache misses in certain pointer lookup operations in gcc
3.4.3 binaries. BTW, are there any tools to actually count cache misses?
If anyone has a few minutes to compile and run the following code, I would be interested in
knowing if you experience the same problems. I'm running AMD64 athlon 3200 with 1024KB cache. I
compiled with
g++ -O3 -Wall -march=k8
Compiled with gcc 3.3.4 average run time: 2.0 seconds
Compiled with gcc 3.4.3 average run time: 2.9 seconds
I've noticed even more dramatic differences in larger codes that actually do something.
I would be interested in answering the following questions:
1) is this observed only on AMD64, or also x86 ?
2) how does gcc 4.0.0 do?
3) are there compiler options that would improve performance (none that I've tried did)
4) what changed between gcc 3.3 and 3.4 to cause this?
If you have any spare time, I think this is an interesting example, and worth the effort for
someone to figure out. I'm afraid my compiler expertise is not sufficient, so I am asking for
some help. Thanks.
Code:
// run time is anywhere from 33 to 50 % longer when compiled with gcc 3.4.3 compared to 3.3.4
// compiled with g++ -O3 -Wall -march=k8 (same performance lag observed with -O2)
//
// Objects are created in a heirarchy of classes.
// When referenced, it seems that the pointer lookups
// must cause more cache misses in gcc 3.4.3 binaries.
#include <stdio.h>
#include <vector>
class mytype_A {
public:
int id;
mytype_A():id(0) {}
};
class mytype_B {
public:
mytype_A* A;
mytype_B(mytype_A* p):A(p) {}
};
class mytype_C {
public:
mytype_B* B;
mytype_C(mytype_B* p):B(p) {}
};
class mytype_D {
public:
// mytype_C* C[2]; // less performance difference if we use simple arrays
std::vector<mytype_C*> C;
int junk[3]; // affects performance (must cause cache misses)
public:
mytype_D(mytype_A* a0, mytype_A* a1) {
// C[0] = new mytype_C(new mytype_B(a0));
// C[1] = new mytype_C(new mytype_B(a0));
C.push_back(new mytype_C(new mytype_B(a0)));
C.push_back(new mytype_C(new mytype_B(a0)));
}
};
int main() {
int k = 5000; // run-time not linear in k
mytype_A* A[k];
mytype_D* D[k];
for (int i=0;i<=k;i++)
A = new mytype_A();
for (int i=0;i<k;i++)
D = new mytype_D(A,A[k-i]); // intentionally make some pointers farther apart
clock_t before = clock();
int k0 = 0;
for (int i=0;i<k;i++) {
k0 = 0;
for (int j=0;j<k;j++) { // run through list of D's, and reference pointers
mytype_D* d = D[j];
if (d->C[0]->B->A->id) k0++;
if (d->C[1]->B->A->id) k0++;
}
}
printf("%d\n",k0); // don't allow compiler to optimize away k0
printf("time: %f\n",(double)(clock()-before)/CLOCKS_PER_SEC);
return 0;
}
lag doesn't happen on the Pentium architectures. Maybe somebody here can try it on their AMD64.
------------
I was noticing significantly worse performance in some of my C++ codes compiled with gcc 3.4.3
as compared to gcc 3.3.4. I have boiled it down into one relatively short code that illustrates.
It seems to be an issue of excessive cache misses in certain pointer lookup operations in gcc
3.4.3 binaries. BTW, are there any tools to actually count cache misses?
If anyone has a few minutes to compile and run the following code, I would be interested in
knowing if you experience the same problems. I'm running AMD64 athlon 3200 with 1024KB cache. I
compiled with
g++ -O3 -Wall -march=k8
Compiled with gcc 3.3.4 average run time: 2.0 seconds
Compiled with gcc 3.4.3 average run time: 2.9 seconds
I've noticed even more dramatic differences in larger codes that actually do something.
I would be interested in answering the following questions:
1) is this observed only on AMD64, or also x86 ?
2) how does gcc 4.0.0 do?
3) are there compiler options that would improve performance (none that I've tried did)
4) what changed between gcc 3.3 and 3.4 to cause this?
If you have any spare time, I think this is an interesting example, and worth the effort for
someone to figure out. I'm afraid my compiler expertise is not sufficient, so I am asking for
some help. Thanks.
Code:
// run time is anywhere from 33 to 50 % longer when compiled with gcc 3.4.3 compared to 3.3.4
// compiled with g++ -O3 -Wall -march=k8 (same performance lag observed with -O2)
//
// Objects are created in a heirarchy of classes.
// When referenced, it seems that the pointer lookups
// must cause more cache misses in gcc 3.4.3 binaries.
#include <stdio.h>
#include <vector>
class mytype_A {
public:
int id;
mytype_A():id(0) {}
};
class mytype_B {
public:
mytype_A* A;
mytype_B(mytype_A* p):A(p) {}
};
class mytype_C {
public:
mytype_B* B;
mytype_C(mytype_B* p):B(p) {}
};
class mytype_D {
public:
// mytype_C* C[2]; // less performance difference if we use simple arrays
std::vector<mytype_C*> C;
int junk[3]; // affects performance (must cause cache misses)
public:
mytype_D(mytype_A* a0, mytype_A* a1) {
// C[0] = new mytype_C(new mytype_B(a0));
// C[1] = new mytype_C(new mytype_B(a0));
C.push_back(new mytype_C(new mytype_B(a0)));
C.push_back(new mytype_C(new mytype_B(a0)));
}
};
int main() {
int k = 5000; // run-time not linear in k
mytype_A* A[k];
mytype_D* D[k];
for (int i=0;i<=k;i++)
A = new mytype_A();
for (int i=0;i<k;i++)
D = new mytype_D(A,A[k-i]); // intentionally make some pointers farther apart
clock_t before = clock();
int k0 = 0;
for (int i=0;i<k;i++) {
k0 = 0;
for (int j=0;j<k;j++) { // run through list of D's, and reference pointers
mytype_D* d = D[j];
if (d->C[0]->B->A->id) k0++;
if (d->C[1]->B->A->id) k0++;
}
}
printf("%d\n",k0); // don't allow compiler to optimize away k0
printf("time: %f\n",(double)(clock()-before)/CLOCKS_PER_SEC);
return 0;
}