B
Bern McCarty
I have run an experiment to try to learn some things about floating point
performance in managed C++. I am using Visual Studio
2003. I was hoping to get a feel for whether or not it would make sense to
punch out from managed code to native code (I was using
IJW) in order to do some amount of floating point work and, if so, what that
certain amount of floating point work was
approximately.
To attempt to do this I made a program that applys a 3x3 matrix to an array
of 3D points (all doubles here folks). The program
contains a function that applies 10 different matrices to the same test data
set of 5,000,000 3D points. It does this by invoking
another workhorse function that does the actual floating point operations.
That function takes an input array of 3D points, an
output array of 3D points, a point count, and the matrix to use. There are
no __gc types in this program. It's just pointers and
structs and native arrays. The outer test function looks like this:
void test_applyMatrixToDPoints(TestData *tdP, int ptsPerMultiply)
{
int jIterations = tdP->pointCnt / ptsPerMultiply;
for (int i = 0 ; i < tdP->matrixCnt ; ++i)
{
for (int j = 0 ; j < jIterations; ++j)
{
// managed-to-native transitions happen here in V2
DMatrix3d_multiplyDPoint3dArray(tdP->matrices + i,
&tdP->outPts[j*ptsPerMultiply],
&tdP->inPts[j*ptsPerMultiply],
ptsPerMultiply);
}
}
}
The program calls the above routine 8 times and records the time elapsed
during each call. On the first call the above function
calls the workhorse function only once for each of the 10 matrices. In
other words, it applies a matrix to all of the 5,000,000
points in the test data set with a single call to the other workhorse
function. In the next call to the above function it passes
only 50,000 points per-call to the other routine, then 5,000, then 500, et
cetera, until we get all of the way down to 5, and then
finally 1 where there is a function call to
DMatrix3d_multiplyDPoint3dArray() for each and every of the 5,000,000 3D
points in the
test data set.
I was hoping someone could help interpret the results. At first I made 3
versions of this program. In all 3 of these versions
the DMatrix3d_multiplyDPoint3dArray function was in a geometry.dll and the
rest of the code was in my test.exe. The 3 versions
were merely different combinations of native versus IL for the two
executables:
test.exe geometry.dll (contains workhorse function)
-------- ----------------
v1) native native
v2) managed native
v3) managed managed
Here are the results. All numbers are elapsed time in seconds for calls to
the outer function described.
Native->Native:
0.953
0.968
0.968
0.953
0.968
0.952
1.093
1.39
Final run is 146% of first run.
Final run is 127% of previous run
Managed->Native
0.968
0.968
0.968
0.969
0.968
0.968
1.124
1.952
Final run is 202% of first run.
Final run is 174% of previous run
Managed->Managed
0.984
1.016
0.985
1
1
1.032
1.516
4.469
Final run is 454% of first run.
Final run is 295% of previous run
This surprised me in two ways. First, I thought that for version 2 the
penalty imposed by managed->native transitions would be
worse. It's there, you can see performance drop off more as the call
granularity becomes very fine toward the end, but it isn't
as much as I might have guessed it would be. More surprising was that the
managed->managed version, which didn't have any
manged->native transitions slowing it down at all, dropped off far worse!
The early calls to the test function compare very
closely between versions 2 and 3, suggesting that the raw floating point
performance of the managed versus native workhorse
function is quite similar. So this seemed to point the finger at function
call overhead. For some reason function call overhead
is just higher for managed code than for native? On a hunch I decided to
make a 4rth version of the program that was also
managed->managed but which eliminated the inter-assembly call. Instead I
just linked everything from geometry.dll right into
test.exe. It made a big difference. The results are below. Is there some
security/stack-walking stuff going on in the inter-DLL
case maybe? Or does it really make sense that managed, inter-assembly calls
are that much slower than the equivalent
intra-assembly call? Explanations welcomed. The inter-assembly version
takes 217% of the time that the intra-assembly version
takes on the final call when the call granularity is fine. That seems
awfully harsh.
Managed->Managed (one big test.exe)
1
0.999
0.984
1.015
0.984
1.015
1.093
2.061
Final run is 206% of first run.
Final run is 189% of previous run.
Even with the improvement yielded by eliminating the inter-assembly calls,
the relative performance between the version that has
to make managed->native transitions and the all managed version is difficult
for me to comprehend. What is it with
managed->managed function call overhead that seems worse even than
managed->native function call overhead?
I tried to make sure that page faults weren't affecting my test runs and the
results I got were very consistent from run to run.
Bern McCarty
Bentley Sytems, Inc.
P.S. For the curious, here is what DMatrix3d_multiplyDPoint3dArray looks
like. There are no function calls made and it is all compiled into IL.
Public void DMatrix3d_multiplyDPoint3dArray
(
const DMatrix3d *pMatrix,
DPoint3d *pResult,
const DPoint3d *pPoint,
int numPoint
)
{
int i;
double x,y,z;
DPoint3d *pResultPoint;
for (i = 0, pResultPoint = pResult;
i < numPoint;
i++, pResultPoint++
)
{
x = pPoint.x;
y = pPoint.y;
z = pPoint.z;
pResultPoint->x = pMatrix->column[0].x * x
+ pMatrix->column[1].x * y
+ pMatrix->column[2].x * z;
pResultPoint->y = pMatrix->column[0].y * x
+ pMatrix->column[1].y * y
+ pMatrix->column[2].y * z;
pResultPoint->z = pMatrix->column[0].z * x
+ pMatrix->column[1].z * y
+ pMatrix->column[2].z * z;
}
performance in managed C++. I am using Visual Studio
2003. I was hoping to get a feel for whether or not it would make sense to
punch out from managed code to native code (I was using
IJW) in order to do some amount of floating point work and, if so, what that
certain amount of floating point work was
approximately.
To attempt to do this I made a program that applys a 3x3 matrix to an array
of 3D points (all doubles here folks). The program
contains a function that applies 10 different matrices to the same test data
set of 5,000,000 3D points. It does this by invoking
another workhorse function that does the actual floating point operations.
That function takes an input array of 3D points, an
output array of 3D points, a point count, and the matrix to use. There are
no __gc types in this program. It's just pointers and
structs and native arrays. The outer test function looks like this:
void test_applyMatrixToDPoints(TestData *tdP, int ptsPerMultiply)
{
int jIterations = tdP->pointCnt / ptsPerMultiply;
for (int i = 0 ; i < tdP->matrixCnt ; ++i)
{
for (int j = 0 ; j < jIterations; ++j)
{
// managed-to-native transitions happen here in V2
DMatrix3d_multiplyDPoint3dArray(tdP->matrices + i,
&tdP->outPts[j*ptsPerMultiply],
&tdP->inPts[j*ptsPerMultiply],
ptsPerMultiply);
}
}
}
The program calls the above routine 8 times and records the time elapsed
during each call. On the first call the above function
calls the workhorse function only once for each of the 10 matrices. In
other words, it applies a matrix to all of the 5,000,000
points in the test data set with a single call to the other workhorse
function. In the next call to the above function it passes
only 50,000 points per-call to the other routine, then 5,000, then 500, et
cetera, until we get all of the way down to 5, and then
finally 1 where there is a function call to
DMatrix3d_multiplyDPoint3dArray() for each and every of the 5,000,000 3D
points in the
test data set.
I was hoping someone could help interpret the results. At first I made 3
versions of this program. In all 3 of these versions
the DMatrix3d_multiplyDPoint3dArray function was in a geometry.dll and the
rest of the code was in my test.exe. The 3 versions
were merely different combinations of native versus IL for the two
executables:
test.exe geometry.dll (contains workhorse function)
-------- ----------------
v1) native native
v2) managed native
v3) managed managed
Here are the results. All numbers are elapsed time in seconds for calls to
the outer function described.
Native->Native:
0.953
0.968
0.968
0.953
0.968
0.952
1.093
1.39
Final run is 146% of first run.
Final run is 127% of previous run
Managed->Native
0.968
0.968
0.968
0.969
0.968
0.968
1.124
1.952
Final run is 202% of first run.
Final run is 174% of previous run
Managed->Managed
0.984
1.016
0.985
1
1
1.032
1.516
4.469
Final run is 454% of first run.
Final run is 295% of previous run
This surprised me in two ways. First, I thought that for version 2 the
penalty imposed by managed->native transitions would be
worse. It's there, you can see performance drop off more as the call
granularity becomes very fine toward the end, but it isn't
as much as I might have guessed it would be. More surprising was that the
managed->managed version, which didn't have any
manged->native transitions slowing it down at all, dropped off far worse!
The early calls to the test function compare very
closely between versions 2 and 3, suggesting that the raw floating point
performance of the managed versus native workhorse
function is quite similar. So this seemed to point the finger at function
call overhead. For some reason function call overhead
is just higher for managed code than for native? On a hunch I decided to
make a 4rth version of the program that was also
managed->managed but which eliminated the inter-assembly call. Instead I
just linked everything from geometry.dll right into
test.exe. It made a big difference. The results are below. Is there some
security/stack-walking stuff going on in the inter-DLL
case maybe? Or does it really make sense that managed, inter-assembly calls
are that much slower than the equivalent
intra-assembly call? Explanations welcomed. The inter-assembly version
takes 217% of the time that the intra-assembly version
takes on the final call when the call granularity is fine. That seems
awfully harsh.
Managed->Managed (one big test.exe)
1
0.999
0.984
1.015
0.984
1.015
1.093
2.061
Final run is 206% of first run.
Final run is 189% of previous run.
Even with the improvement yielded by eliminating the inter-assembly calls,
the relative performance between the version that has
to make managed->native transitions and the all managed version is difficult
for me to comprehend. What is it with
managed->managed function call overhead that seems worse even than
managed->native function call overhead?
I tried to make sure that page faults weren't affecting my test runs and the
results I got were very consistent from run to run.
Bern McCarty
Bentley Sytems, Inc.
P.S. For the curious, here is what DMatrix3d_multiplyDPoint3dArray looks
like. There are no function calls made and it is all compiled into IL.
Public void DMatrix3d_multiplyDPoint3dArray
(
const DMatrix3d *pMatrix,
DPoint3d *pResult,
const DPoint3d *pPoint,
int numPoint
)
{
int i;
double x,y,z;
DPoint3d *pResultPoint;
for (i = 0, pResultPoint = pResult;
i < numPoint;
i++, pResultPoint++
)
{
x = pPoint.x;
y = pPoint.y;
z = pPoint.z;
pResultPoint->x = pMatrix->column[0].x * x
+ pMatrix->column[1].x * y
+ pMatrix->column[2].x * z;
pResultPoint->y = pMatrix->column[0].y * x
+ pMatrix->column[1].y * y
+ pMatrix->column[2].y * z;
pResultPoint->z = pMatrix->column[0].z * x
+ pMatrix->column[1].z * y
+ pMatrix->column[2].z * z;
}