Unwinding a loop: why can't .NET add 1 + 2 + 3 ... ?

  • Thread starter Thread starter Mountain Bikn' Guy
  • Start date Start date
Since you asked...

I could use some help with CodeDOM. Here is how I am generating
these loops:

//
//TODO: generate this entire expression using CodeDOM classes:
StringBuilder formulaString = new StringBuilder();
formulaString.Append("dataPoints[row] = 0");
if (columns != null)
{
formulaString.Capacity = 40 * (columns.Count + 2);
for (int j = 0; j < columns.Count; ++j)
{
if ( ((MyColumnType)columns[j]).Role == RoleType.Independent)
{
formulaString.Append(System.Environment.NewLine);
formulaString.Append("+ coefficients[" + j + "] * table[row][" +
j + "] "); }
}
}
formulaString.Append(";");
//

The rest of the stuff (namespace, class, outer for loop, return
statement, etc.) is all generated purely without code snippet
expressions (with 1 exception -- see bug below). Anyone want to
show me how to do the above expression without code snippets?
(Note the need to break the lines to keep them below the 2046
char limit.)

The one other issue I am having trouble with is this:

compute.ReturnType = new CodeTypeReference("public float[]");
//suggested workaround for bug below
compute.Attributes |= MemberAttributes.Public;//has a bug

Is there an easy, language neutral way to work around the
MemberAttributes.Public bug when I return an array?

Dave,

[I don't mean to come across as arrogant in the post below. This is
just my long-winded way of saying that there may be a much simpler
solution to your problem.]

I've read most of the messages in the threads that deal with the
problems you're runing into. I'm not going to pretend that I know
what all of your requirements are, or the reason why you're writing
this code (learning exercise or production code?). But I get the
feeling that you may be trying to pound a round peg through a square
hole.

Lately I've become a big fan of Occam's Razor
(http://pespmc1.vub.ac.be/OCCAMRAZ.html). Dynamically generating and
compiling code is usually the slowest and most complex way to
implement a solution. It's almost always my last choice. When
designing a solution to a complex and/or large problem, my solution
generally comes from a list like this (in order of simplicity):

1. structured (or non-polymorphic OOP code) code w/ a fixed data
structure
2. same code as #1, but w/ a dynamic data structure
3. polymorphic OOP and/or interfaced code w/ a fixed data structure
4. same code as #3, but w/ a dynamic data structure
5. database solution (SQL)
6. potentially very complex code and data (writing a custom
compiler, lengthy regular expressions, using CodeDOM, etc.)

Another way of looking at this list:

1. simple code and data.
2. simple code and moderately complex data.
3. moderately complex code and simple data.
4. moderately complex code and data.
5. range of simple-to-complex code and data, but has an external
dependency on a database engine. May also have performance
drawbacks.
6. very complex code and data.

Too often I've learned (the hard way) that a complex solution to a
seemingly simple problem is often an indicator of a design flaw.
Using lists like this has helped me catch design errors like that and
fix them before they infect the rest of the app.

Just my $0.02...

Chris.
 
I really don't want to make this thread any longer than it already is but I
found Chris' comments (above) interesting. And I thought I would add my own
$0.02.

If the nature of this app is such that every bit of performance is needed,
to the extent that even loops have to be unrolled (something that's usually
found in game engines but not business apps), I would question the use of
..NET. Using C++ (non-managed) you will shave a whole layer which will yield
further performance advantages if well done.

- Santiago

Mountain Bikn' Guy said:
Since you asked...

I could use some help with CodeDOM. Here is how I am generating these loops:

//
//TODO: generate this entire expression using CodeDOM classes:
StringBuilder formulaString = new StringBuilder();
formulaString.Append("dataPoints[row] = 0");
if (columns != null)
{
formulaString.Capacity = 40 * (columns.Count + 2);
for (int j = 0; j < columns.Count; ++j)
{
if ( ((MyColumnType)columns[j]).Role == RoleType.Independent)
{
formulaString.Append(System.Environment.NewLine);
formulaString.Append("+ coefficients[" + j + "] * table[row][" + j + "] ");
}
}
}
formulaString.Append(";");
//

The rest of the stuff (namespace, class, outer for loop, return statement,
etc.) is all generated purely without code snippet expressions (with 1
exception -- see bug below). Anyone want to show me how to do the above
expression without code snippets? (Note the need to break the lines to keep
them below the 2046 char limit.)

The one other issue I am having trouble with is this:

compute.ReturnType = new CodeTypeReference("public float[]"); //suggested
workaround for bug below
compute.Attributes |= MemberAttributes.Public;//has a bug

Is there an easy, language neutral way to work around the
MemberAttributes.Public bug when I return an array?

TIA,
Dave



anonymouse said:
Did you make these unrolled loops manually or did you use a CodeDom to
generate them?

Curious.

Why do it the hard way:D
thought
in
up
a
series of terms and it produces the correct result.

// sum numbers with a loop
public int DoSumLooping(int iterations)
{
int result = 0;
for(int i = 1;i <=iterations;i++)
{
result += i;
}
return result;
}

Now translate this into a specific solution that doesn't use looping (and
use the same value for the number of iterations the loop performs). This
code returns an incorrect result. The method consists entirely of a very
straightforward code statement, but in this case .NET adds the numbers
incorrectly.
public double ComputeSum( )
{
// Brute force sum method
// For iterations == 10000
double sum = 0+ 1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 13+ 14+ 15+
...
+ 9997+ 9998+ 9999+ 10000;
return sum;
}
The above method returns an incorrect result with any number of terms
above
about 200. It will correctly add 1 + 2 + ... + 200, but it will NOT
correctly add 1 + 2 + ... + 1000.

I have just run across this, and I have not yet researched the possible
reasons for this behavior. It may be a known issue related to either stack
size or the length of a code line, but to my knowledge it hasn't been
discussed in any of the "popular" literature on C# and .NET. I need to
write
code like this, so if anyone has already encountered this issue, please
advise me.


Here's another example that also creates problems, but of a somewhat
different nature. Take the following code and translate it into a
specific,
non-looping method and try to execute it using reflection. It fails.

public double LoopToCompute()
{
double sumOfProducts = 0;
double grandTotal = 0;
for (int i = 0; i < maxRows; ++i)
{
for (int j = 0; j < maxCols; ++j)
{
sumOfProducts += coeff[j] * table[j];
}
a_point = sumOfProducts;
grandTotal += sumOfProducts;
sumOfProducts = 0;
}
return grandTotal;
}//LoopToCompute

The above code works -- but it's equivalent code with loops unrolled
(shown
below) doesn't work unless the maxRows is set very small. For small
values,
the 2 methods (above and below) produce identical results. There is
nothing
"wrong" with the code in that sense. It's similar to the above situation.
If
the "size" of the code statement or the number of code statements is too
large, .NET fails. In this case (using reflection) it doesn't return the
incorrect result, as the first example did. In this case, reflection calls
it an invalid program and refuses to run it (but only when the value of
maxRows is above about 250). The reason for this is probably
straightforward. However, I have the need to make statements like
this
for
performance reasons so I need a work-around. Any suggestions are
appreciated! All comments are appreciated.

public double DoBruteForceCompute()
{
double bruteForceSum = 0;

point1=coeff1*table[0][0] +coeff2*table[0][1] +coeff3*table[0][2]
+coeff4*table[0][3] +coeff5*table[0][4] +coeff6*table[0][5]
+coeff7*table[0][6] +coeff8*table[0][7] +coeff9*table[0][8]
+coeff10*table[0][9] +coeff11*table[0][10] +coeff12*table[0][11]
+coeff13*table[0][12] +coeff14*table[0][13] +coeff15*table[0][14]
+coeff16*table[0][15] +coeff17*table[0][16] +coeff18*table[0][17]
+coeff19*table[0][18] +coeff20*table[0][19] +coeff21*table[0][20]
+coeff22*table[0][21] +coeff23*table[0][22] +coeff24*table[0][23]
+coeff25*table[0][24] +coeff26*table[0][25] +coeff27*table[0][26]
+coeff28*table[0][27] +coeff29*table[0][28] +coeff30*table[0][29]
+coeff31*table[0][30] +coeff32*table[0][31] +coeff33*table[0][32]
+coeff34*table[0][33] +coeff35*table[0][34] ;

point2=coeff1*table[1][0] +coeff2*table[1][1] +coeff3*table[1][2]
+coeff4*table[1][3] +coeff5*table[1][4] +coeff6*table[1][5]
+coeff7*table[1][6] +coeff8*table[1][7] +coeff9*table[1][8]
+coeff10*table[1][9] +coeff11*table[1][10] +coeff12*table[1][11]
+coeff13*table[1][12] +coeff14*table[1][13] +coeff15*table[1][14]
+coeff16*table[1][15] +coeff17*table[1][16] +coeff18*table[1][17]
+coeff19*table[1][18] +coeff20*table[1][19] +coeff21*table[1][20]
+coeff22*table[1][21] +coeff23*table[1][22] +coeff24*table[1][23]
+coeff25*table[1][24] +coeff26*table[1][25] +coeff27*table[1][26]
+coeff28*table[1][27] +coeff29*table[1][28] +coeff30*table[1][29]
+coeff31*table[1][30] +coeff32*table[1][31] +coeff33*table[1][32]
+coeff34*table[1][33] +coeff35*table[1][34] ;


[...]

point500=coeff1*table[499][0] +coeff2*table[499][1] +coeff3*table[499][2]
+coeff4*table[499][3] +coeff5*table[499][4] +coeff6*table[499][5]
+coeff7*table[499][6] +coeff8*table[499][7] +coeff9*table[499][8]
+coeff10*table[499][9] +coeff11*table[499][10] +coeff12*table[499][11]
+coeff13*table[499][12] +coeff14*table[499][13] +coeff15*table[499][14]
+coeff16*table[499][15] +coeff17*table[499][16] +coeff18*table[499][17]
+coeff19*table[499][18] +coeff20*table[499][19] +coeff21*table[499][20]
+coeff22*table[499][21] +coeff23*table[499][22] +coeff24*table[499][23]
+coeff25*table[499][24] +coeff26*table[499][25] +coeff27*table[499][26]
+coeff28*table[499][27] +coeff29*table[499][28] +coeff30*table[499][29]
+coeff31*table[499][30] +coeff32*table[499][31] +coeff33*table[499][32]
+coeff34*table[499][33] +coeff35*table[499][34] ;

bruteForceSum =
point1 +
point2 + ... +

point499 +
point500
;

return bruteForceSum;

}//DoBruteForceCompute

 
sample program sent!
Thanks for your interest!
Dave

Eric Gunnerson said:
Comments inline

--
Eric Gunnerson

Visit the C# product team at http://www.csharp.net
Eric's blog is at http://blogs.gotdotnet.com/ericgu/

This posting is provided "AS IS" with no warranties, and confers no rights.
Mountain Bikn' Guy said:
In the event anyone wishes to review this thread now or later, I thought I
would provide some conclusions that resulted from everyone's input and from
my reading:

1. unrolling loops often doesn't provide a performance gain on large (real
life size?) problems. In my experience, the unrolled loop was 5-10 times
SLOWER than the simple nested loop that I used as a reference. I found the
following quote in Eric Gunnerson's book (2nd ed). He said (about unrolled
loops), a "function is so big it doesn't fit into cache and therefore gets
slower performance."

Jon added the following info:
I suspect he means the processor's instruction
cache - if it's a small loop, the processor (not JIT, note!) can decode
the instructions once and keep the microcode available for further
iterations.

I do mean the processor cache, but it's not necessarily just an instruction
effect.

Jan Gray's excellent article on managed code performance has a lot more
detail on this:

http://msdn.microsoft.com/library/?url=/library/en-us/dndotnet/html/fastmanagedcode.asp
2. VS.NET has a 2048 char line limit in code files. In my experience, when
using runtime code generation and compiling, this can cause unusual trouble
if one isn't aware of it. I was able to compile and run an application that
was created from a code file with a line longer than this limit. The only
error I encountered was that the mathematical expression in the executing
program returned incorrect results! (The fix was simply inserting line
breaks into the code file.)

3. VS.NET has some compiler bugs that will result in the following error
message:
Additional information: Common Language Runtime detected an invalid program.

This appears to happen on complex mathematical expressions. However, in my
case the expression itself was not that complex. It was just long (but each
line wasn't too long). When I reached about 1000 subexpressions in a simple
sum of products expression, I encountered this compiler bug. Search on
Google Groups using the error message, and you'll see that others have
encountered it (frequently) when dealing with complex mathematical
expressions.

To my knowledge, there is no solution for this. Anyone from Microsoft care
to comment?

I haven't heard of this before. Can you send me a program that demonstrates
the problem?
(e-mail address removed)
That about sums up (no pun intended) what I've learned so far. HTH.
Dave


Mountain Bikn' Guy said:
Take some standard code such as shown below. It simply loops to add up a
series of terms and it produces the correct result.

// sum numbers with a loop
public int DoSumLooping(int iterations)
{
int result = 0;
for(int i = 1;i <=iterations;i++)
{
result += i;
}
return result;
}

Now translate this into a specific solution that doesn't use looping (and
use the same value for the number of iterations the loop performs). This
code returns an incorrect result. The method consists entirely of a very
straightforward code statement, but in this case .NET adds the numbers
incorrectly.
public double ComputeSum( )
{
// Brute force sum method
// For iterations == 10000
double sum = 0+ 1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 13+ 14+
15+
...
+ 9997+ 9998+ 9999+ 10000;
return sum;
}
The above method returns an incorrect result with any number of terms above
about 200. It will correctly add 1 + 2 + ... + 200, but it will NOT
correctly add 1 + 2 + ... + 1000.

I have just run across this, and I have not yet researched the possible
reasons for this behavior. It may be a known issue related to either stack
size or the length of a code line, but to my knowledge it hasn't been
discussed in any of the "popular" literature on C# and .NET. I need to write
code like this, so if anyone has already encountered this issue, please
advise me.


Here's another example that also creates problems, but of a somewhat
different nature. Take the following code and translate it into a specific,
non-looping method and try to execute it using reflection. It fails.

public double LoopToCompute()
{
double sumOfProducts = 0;
double grandTotal = 0;
for (int i = 0; i < maxRows; ++i)
{
for (int j = 0; j < maxCols; ++j)
{
sumOfProducts += coeff[j] * table[j];
}
a_point = sumOfProducts;
grandTotal += sumOfProducts;
sumOfProducts = 0;
}
return grandTotal;
}//LoopToCompute

The above code works -- but it's equivalent code with loops unrolled (shown
below) doesn't work unless the maxRows is set very small. For small values,
the 2 methods (above and below) produce identical results. There is nothing
"wrong" with the code in that sense. It's similar to the above
situation.
If
the "size" of the code statement or the number of code statements is too
large, .NET fails. In this case (using reflection) it doesn't return the
incorrect result, as the first example did. In this case, reflection calls
it an invalid program and refuses to run it (but only when the value of
maxRows is above about 250). The reason for this is probably
straightforward. However, I have the need to make statements like this for
performance reasons so I need a work-around. Any suggestions are
appreciated! All comments are appreciated.

public double DoBruteForceCompute()
{
double bruteForceSum = 0;

point1=coeff1*table[0][0] +coeff2*table[0][1] +coeff3*table[0][2]
+coeff4*table[0][3] +coeff5*table[0][4] +coeff6*table[0][5]
+coeff7*table[0][6] +coeff8*table[0][7] +coeff9*table[0][8]
+coeff10*table[0][9] +coeff11*table[0][10] +coeff12*table[0][11]
+coeff13*table[0][12] +coeff14*table[0][13] +coeff15*table[0][14]
+coeff16*table[0][15] +coeff17*table[0][16] +coeff18*table[0][17]
+coeff19*table[0][18] +coeff20*table[0][19] +coeff21*table[0][20]
+coeff22*table[0][21] +coeff23*table[0][22] +coeff24*table[0][23]
+coeff25*table[0][24] +coeff26*table[0][25] +coeff27*table[0][26]
+coeff28*table[0][27] +coeff29*table[0][28] +coeff30*table[0][29]
+coeff31*table[0][30] +coeff32*table[0][31] +coeff33*table[0][32]
+coeff34*table[0][33] +coeff35*table[0][34] ;

point2=coeff1*table[1][0] +coeff2*table[1][1] +coeff3*table[1][2]
+coeff4*table[1][3] +coeff5*table[1][4] +coeff6*table[1][5]
+coeff7*table[1][6] +coeff8*table[1][7] +coeff9*table[1][8]
+coeff10*table[1][9] +coeff11*table[1][10] +coeff12*table[1][11]
+coeff13*table[1][12] +coeff14*table[1][13] +coeff15*table[1][14]
+coeff16*table[1][15] +coeff17*table[1][16] +coeff18*table[1][17]
+coeff19*table[1][18] +coeff20*table[1][19] +coeff21*table[1][20]
+coeff22*table[1][21] +coeff23*table[1][22] +coeff24*table[1][23]
+coeff25*table[1][24] +coeff26*table[1][25] +coeff27*table[1][26]
+coeff28*table[1][27] +coeff29*table[1][28] +coeff30*table[1][29]
+coeff31*table[1][30] +coeff32*table[1][31] +coeff33*table[1][32]
+coeff34*table[1][33] +coeff35*table[1][34] ;


[...]

point500=coeff1*table[499][0] +coeff2*table[499][1] +coeff3*table[499][2]
+coeff4*table[499][3] +coeff5*table[499][4] +coeff6*table[499][5]
+coeff7*table[499][6] +coeff8*table[499][7] +coeff9*table[499][8]
+coeff10*table[499][9] +coeff11*table[499][10] +coeff12*table[499][11]
+coeff13*table[499][12] +coeff14*table[499][13] +coeff15*table[499][14]
+coeff16*table[499][15] +coeff17*table[499][16] +coeff18*table[499][17]
+coeff19*table[499][18] +coeff20*table[499][19] +coeff21*table[499][20]
+coeff22*table[499][21] +coeff23*table[499][22] +coeff24*table[499][23]
+coeff25*table[499][24] +coeff26*table[499][25] +coeff27*table[499][26]
+coeff28*table[499][27] +coeff29*table[499][28] +coeff30*table[499][29]
+coeff31*table[499][30] +coeff32*table[499][31] +coeff33*table[499][32]
+coeff34*table[499][33] +coeff35*table[499][34] ;

bruteForceSum =
point1 +
point2 + ... +

point499 +
point500
;

return bruteForceSum;

}//DoBruteForceCompute

 
My results so far indicate that C# can be just as fast as C++ on
mathematical expressions. (If I need to, I use pointers.) I don't see any
limitations that indicate C# can't do the job. It's a great language, IMO.

Santiago said:
I really don't want to make this thread any longer than it already is but I
found Chris' comments (above) interesting. And I thought I would add my own
$0.02.

If the nature of this app is such that every bit of performance is needed,
to the extent that even loops have to be unrolled (something that's usually
found in game engines but not business apps), I would question the use of
.NET. Using C++ (non-managed) you will shave a whole layer which will yield
further performance advantages if well done.

- Santiago

Mountain Bikn' Guy said:
Since you asked...

I could use some help with CodeDOM. Here is how I am generating these loops:

//
//TODO: generate this entire expression using CodeDOM classes:
StringBuilder formulaString = new StringBuilder();
formulaString.Append("dataPoints[row] = 0");
if (columns != null)
{
formulaString.Capacity = 40 * (columns.Count + 2);
for (int j = 0; j < columns.Count; ++j)
{
if ( ((MyColumnType)columns[j]).Role == RoleType.Independent)
{
formulaString.Append(System.Environment.NewLine);
formulaString.Append("+ coefficients[" + j + "] * table[row][" + j + "] ");
}
}
}
formulaString.Append(";");
//

The rest of the stuff (namespace, class, outer for loop, return statement,
etc.) is all generated purely without code snippet expressions (with 1
exception -- see bug below). Anyone want to show me how to do the above
expression without code snippets? (Note the need to break the lines to keep
them below the 2046 char limit.)

The one other issue I am having trouble with is this:

compute.ReturnType = new CodeTypeReference("public float[]"); //suggested
workaround for bug below
compute.Attributes |= MemberAttributes.Public;//has a bug

Is there an easy, language neutral way to work around the
MemberAttributes.Public bug when I return an array?

TIA,
Dave



anonymouse said:
Did you make these unrolled loops manually or did you use a CodeDom to
generate them?

Curious.

Why do it the hard way:D


In the event anyone wishes to review this thread now or later, I
thought
I
would provide some conclusions that resulted from everyone's input and
from
my reading:

1. unrolling loops often doesn't provide a performance gain on large (real
life size?) problems. In my experience, the unrolled loop was 5-10 times
SLOWER than the simple nested loop that I used as a reference. I
found
the
following quote in Eric Gunnerson's book (2nd ed). He said (about unrolled
loops), a "function is so big it doesn't fit into cache and
therefore
gets
slower performance."

Jon added the following info:
I suspect he means the processor's instruction
cache - if it's a small loop, the processor (not JIT, note!) can decode
the instructions once and keep the microcode available for further
iterations.

2. VS.NET has a 2048 char line limit in code files. In my
experience,
when
using runtime code generation and compiling, this can cause unusual
trouble
if one isn't aware of it. I was able to compile and run an application
that
was created from a code file with a line longer than this limit. The only
error I encountered was that the mathematical expression in the executing
program returned incorrect results! (The fix was simply inserting line
breaks into the code file.)

3. VS.NET has some compiler bugs that will result in the following error
message:
Additional information: Common Language Runtime detected an invalid
program.

This appears to happen on complex mathematical expressions. However,
in
my
case the expression itself was not that complex. It was just long (but
each
line wasn't too long). When I reached about 1000 subexpressions in a
simple
sum of products expression, I encountered this compiler bug. Search on
Google Groups using the error message, and you'll see that others have
encountered it (frequently) when dealing with complex mathematical
expressions.

To my knowledge, there is no solution for this. Anyone from
Microsoft
care
to comment?

That about sums up (no pun intended) what I've learned so far. HTH.
Dave


Take some standard code such as shown below. It simply loops to
add
up performs).
This a
very 14+
15+
need
to
write
code like this, so if anyone has already encountered this issue, please
advise me.


Here's another example that also creates problems, but of a somewhat
different nature. Take the following code and translate it into a
specific,
non-looping method and try to execute it using reflection. It fails.

public double LoopToCompute()
{
double sumOfProducts = 0;
double grandTotal = 0;
for (int i = 0; i < maxRows; ++i)
{
for (int j = 0; j < maxCols; ++j)
{
sumOfProducts += coeff[j] * table[j];
}
a_point = sumOfProducts;
grandTotal += sumOfProducts;
sumOfProducts = 0;
}
return grandTotal;
}//LoopToCompute

The above code works -- but it's equivalent code with loops unrolled
(shown
below) doesn't work unless the maxRows is set very small. For small
values,
the 2 methods (above and below) produce identical results. There is
nothing
"wrong" with the code in that sense. It's similar to the above
situation.
If
the "size" of the code statement or the number of code statements
is
too
large, .NET fails. In this case (using reflection) it doesn't
return
the
incorrect result, as the first example did. In this case, reflection
calls
it an invalid program and refuses to run it (but only when the
value
of
maxRows is above about 250). The reason for this is probably
straightforward. However, I have the need to make statements like this
for
performance reasons so I need a work-around. Any suggestions are
appreciated! All comments are appreciated.

public double DoBruteForceCompute()
{
double bruteForceSum = 0;

point1=coeff1*table[0][0] +coeff2*table[0][1] +coeff3*table[0][2]
+coeff4*table[0][3] +coeff5*table[0][4] +coeff6*table[0][5]
+coeff7*table[0][6] +coeff8*table[0][7] +coeff9*table[0][8]
+coeff10*table[0][9] +coeff11*table[0][10] +coeff12*table[0][11]
+coeff13*table[0][12] +coeff14*table[0][13] +coeff15*table[0][14]
+coeff16*table[0][15] +coeff17*table[0][16] +coeff18*table[0][17]
+coeff19*table[0][18] +coeff20*table[0][19] +coeff21*table[0][20]
+coeff22*table[0][21] +coeff23*table[0][22] +coeff24*table[0][23]
+coeff25*table[0][24] +coeff26*table[0][25] +coeff27*table[0][26]
+coeff28*table[0][27] +coeff29*table[0][28] +coeff30*table[0][29]
+coeff31*table[0][30] +coeff32*table[0][31] +coeff33*table[0][32]
+coeff34*table[0][33] +coeff35*table[0][34] ;

point2=coeff1*table[1][0] +coeff2*table[1][1] +coeff3*table[1][2]
+coeff4*table[1][3] +coeff5*table[1][4] +coeff6*table[1][5]
+coeff7*table[1][6] +coeff8*table[1][7] +coeff9*table[1][8]
+coeff10*table[1][9] +coeff11*table[1][10] +coeff12*table[1][11]
+coeff13*table[1][12] +coeff14*table[1][13] +coeff15*table[1][14]
+coeff16*table[1][15] +coeff17*table[1][16] +coeff18*table[1][17]
+coeff19*table[1][18] +coeff20*table[1][19] +coeff21*table[1][20]
+coeff22*table[1][21] +coeff23*table[1][22] +coeff24*table[1][23]
+coeff25*table[1][24] +coeff26*table[1][25] +coeff27*table[1][26]
+coeff28*table[1][27] +coeff29*table[1][28] +coeff30*table[1][29]
+coeff31*table[1][30] +coeff32*table[1][31] +coeff33*table[1][32]
+coeff34*table[1][33] +coeff35*table[1][34] ;


[...]

point500=coeff1*table[499][0] +coeff2*table[499][1]
+coeff3*table[499][2]
+coeff4*table[499][3] +coeff5*table[499][4] +coeff6*table[499][5]
+coeff7*table[499][6] +coeff8*table[499][7] +coeff9*table[499][8]
+coeff10*table[499][9] +coeff11*table[499][10] +coeff12*table[499][11]
+coeff13*table[499][12] +coeff14*table[499][13] +coeff15*table[499][14]
+coeff16*table[499][15] +coeff17*table[499][16] +coeff18*table[499][17]
+coeff19*table[499][18] +coeff20*table[499][19] +coeff21*table[499][20]
+coeff22*table[499][21] +coeff23*table[499][22] +coeff24*table[499][23]
+coeff25*table[499][24] +coeff26*table[499][25] +coeff27*table[499][26]
+coeff28*table[499][27] +coeff29*table[499][28] +coeff30*table[499][29]
+coeff31*table[499][30] +coeff32*table[499][31] +coeff33*table[499][32]
+coeff34*table[499][33] +coeff35*table[499][34] ;

bruteForceSum =
point1 +
point2 + ... +

point499 +
point500
;

return bruteForceSum;

}//DoBruteForceCompute

 
so long as you want to be dependent on m$oft patents and copyrights
and libraries that are not open ( unless you use mono ).


Mountain Bikn' Guy said:
My results so far indicate that C# can be just as fast as C++ on
mathematical expressions. (If I need to, I use pointers.) I don't see any
limitations that indicate C# can't do the job. It's a great language, IMO.

Santiago said:
I really don't want to make this thread any longer than it already is
but
I
found Chris' comments (above) interesting. And I thought I would add my own
$0.02.

If the nature of this app is such that every bit of performance is needed,
to the extent that even loops have to be unrolled (something that's usually
found in game engines but not business apps), I would question the use of
.NET. Using C++ (non-managed) you will shave a whole layer which will yield
further performance advantages if well done.

- Santiago

Mountain Bikn' Guy said:
Since you asked...

I could use some help with CodeDOM. Here is how I am generating these loops:

//
//TODO: generate this entire expression using CodeDOM classes:
StringBuilder formulaString = new StringBuilder();
formulaString.Append("dataPoints[row] = 0");
if (columns != null)
{
formulaString.Capacity = 40 * (columns.Count + 2);
for (int j = 0; j < columns.Count; ++j)
{
if ( ((MyColumnType)columns[j]).Role == RoleType.Independent)
{
formulaString.Append(System.Environment.NewLine);
formulaString.Append("+ coefficients[" + j + "] * table[row][" + j +
"]
");
}
}
}
formulaString.Append(";");
//

The rest of the stuff (namespace, class, outer for loop, return statement,
etc.) is all generated purely without code snippet expressions (with 1
exception -- see bug below). Anyone want to show me how to do the above
expression without code snippets? (Note the need to break the lines to keep
them below the 2046 char limit.)

The one other issue I am having trouble with is this:

compute.ReturnType = new CodeTypeReference("public float[]"); //suggested
workaround for bug below
compute.Attributes |= MemberAttributes.Public;//has a bug

Is there an easy, language neutral way to work around the
MemberAttributes.Public bug when I return an array?

TIA,
Dave



Did you make these unrolled loops manually or did you use a CodeDom to
generate them?

Curious.

Why do it the hard way:D


In the event anyone wishes to review this thread now or later, I thought
I
would provide some conclusions that resulted from everyone's input and
from
my reading:

1. unrolling loops often doesn't provide a performance gain on large
(real
life size?) problems. In my experience, the unrolled loop was 5-10 times
SLOWER than the simple nested loop that I used as a reference. I found
the
following quote in Eric Gunnerson's book (2nd ed). He said (about
unrolled
loops), a "function is so big it doesn't fit into cache and therefore
gets
slower performance."

Jon added the following info:
I suspect he means the processor's instruction
cache - if it's a small loop, the processor (not JIT, note!) can decode
the instructions once and keep the microcode available for further
iterations.

2. VS.NET has a 2048 char line limit in code files. In my experience,
when
using runtime code generation and compiling, this can cause unusual
trouble
if one isn't aware of it. I was able to compile and run an application
that
was created from a code file with a line longer than this limit. The
only
error I encountered was that the mathematical expression in the
executing
program returned incorrect results! (The fix was simply inserting line
breaks into the code file.)

3. VS.NET has some compiler bugs that will result in the following error
message:
Additional information: Common Language Runtime detected an invalid
program.

This appears to happen on complex mathematical expressions.
However,
in
my
case the expression itself was not that complex. It was just long (but
each
line wasn't too long). When I reached about 1000 subexpressions in a
simple
sum of products expression, I encountered this compiler bug.
Search
of
a
very
straightforward code statement, but in this case .NET adds the numbers
incorrectly.
public double ComputeSum( )
{
// Brute force sum method
// For iterations == 10000
double sum = 0+ 1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 13+ 14+
15+
...
+ 9997+ 9998+ 9999+ 10000;
return sum;
}
The above method returns an incorrect result with any number of terms
above
about 200. It will correctly add 1 + 2 + ... + 200, but it will NOT
correctly add 1 + 2 + ... + 1000.

I have just run across this, and I have not yet researched the
possible
reasons for this behavior. It may be a known issue related to either
stack
size or the length of a code line, but to my knowledge it hasn't been
discussed in any of the "popular" literature on C# and .NET. I
need
to
write
code like this, so if anyone has already encountered this issue,
please
advise me.


Here's another example that also creates problems, but of a somewhat
different nature. Take the following code and translate it into a
specific,
non-looping method and try to execute it using reflection. It fails.

public double LoopToCompute()
{
double sumOfProducts = 0;
double grandTotal = 0;
for (int i = 0; i < maxRows; ++i)
{
for (int j = 0; j < maxCols; ++j)
{
sumOfProducts += coeff[j] * table[j];
}
a_point = sumOfProducts;
grandTotal += sumOfProducts;
sumOfProducts = 0;
}
return grandTotal;
}//LoopToCompute

The above code works -- but it's equivalent code with loops unrolled
(shown
below) doesn't work unless the maxRows is set very small. For small
values,
the 2 methods (above and below) produce identical results. There is
nothing
"wrong" with the code in that sense. It's similar to the above
situation.
If
the "size" of the code statement or the number of code
statements
is
too
large, .NET fails. In this case (using reflection) it doesn't return
the
incorrect result, as the first example did. In this case, reflection
calls
it an invalid program and refuses to run it (but only when the value
of
maxRows is above about 250). The reason for this is probably
straightforward. However, I have the need to make statements
like
this
for
performance reasons so I need a work-around. Any suggestions are
appreciated! All comments are appreciated.

public double DoBruteForceCompute()
{
double bruteForceSum = 0;

point1=coeff1*table[0][0] +coeff2*table[0][1] +coeff3*table[0][2]
+coeff4*table[0][3] +coeff5*table[0][4] +coeff6*table[0][5]
+coeff7*table[0][6] +coeff8*table[0][7] +coeff9*table[0][8]
+coeff10*table[0][9] +coeff11*table[0][10] +coeff12*table[0][11]
+coeff13*table[0][12] +coeff14*table[0][13] +coeff15*table[0][14]
+coeff16*table[0][15] +coeff17*table[0][16] +coeff18*table[0][17]
+coeff19*table[0][18] +coeff20*table[0][19] +coeff21*table[0][20]
+coeff22*table[0][21] +coeff23*table[0][22] +coeff24*table[0][23]
+coeff25*table[0][24] +coeff26*table[0][25] +coeff27*table[0][26]
+coeff28*table[0][27] +coeff29*table[0][28] +coeff30*table[0][29]
+coeff31*table[0][30] +coeff32*table[0][31] +coeff33*table[0][32]
+coeff34*table[0][33] +coeff35*table[0][34] ;

point2=coeff1*table[1][0] +coeff2*table[1][1] +coeff3*table[1][2]
+coeff4*table[1][3] +coeff5*table[1][4] +coeff6*table[1][5]
+coeff7*table[1][6] +coeff8*table[1][7] +coeff9*table[1][8]
+coeff10*table[1][9] +coeff11*table[1][10] +coeff12*table[1][11]
+coeff13*table[1][12] +coeff14*table[1][13] +coeff15*table[1][14]
+coeff16*table[1][15] +coeff17*table[1][16] +coeff18*table[1][17]
+coeff19*table[1][18] +coeff20*table[1][19] +coeff21*table[1][20]
+coeff22*table[1][21] +coeff23*table[1][22] +coeff24*table[1][23]
+coeff25*table[1][24] +coeff26*table[1][25] +coeff27*table[1][26]
+coeff28*table[1][27] +coeff29*table[1][28] +coeff30*table[1][29]
+coeff31*table[1][30] +coeff32*table[1][31] +coeff33*table[1][32]
+coeff34*table[1][33] +coeff35*table[1][34] ;


[...]

point500=coeff1*table[499][0] +coeff2*table[499][1]
+coeff3*table[499][2]
+coeff4*table[499][3] +coeff5*table[499][4] +coeff6*table[499][5]
+coeff7*table[499][6] +coeff8*table[499][7] +coeff9*table[499][8]
+coeff10*table[499][9] +coeff11*table[499][10] +coeff12*table[499][11]
+coeff13*table[499][12] +coeff14*table[499][13]
+coeff15*table[499][14]
+coeff16*table[499][15] +coeff17*table[499][16]
+coeff18*table[499][17]
+coeff19*table[499][18] +coeff20*table[499][19]
+coeff21*table[499][20]
+coeff22*table[499][21] +coeff23*table[499][22]
+coeff24*table[499][23]
+coeff25*table[499][24] +coeff26*table[499][25]
+coeff27*table[499][26]
+coeff28*table[499][27] +coeff29*table[499][28]
+coeff30*table[499][29]
+coeff31*table[499][30] +coeff32*table[499][31]
+coeff33*table[499][32]
+coeff34*table[499][33] +coeff35*table[499][34] ;

bruteForceSum =
point1 +
point2 + ... +

point499 +
point500
;

return bruteForceSum;

}//DoBruteForceCompute
 
inline

--
David Notario
Software Design Engineer, CLR JIT Compiler
http://devdiary.xplsv.com



Mountain Bikn' Guy said:
In the event anyone wishes to review this thread now or later, I thought I
would provide some conclusions that resulted from everyone's input and from
my reading:

1. unrolling loops often doesn't provide a performance gain on large (real
life size?) problems. In my experience, the unrolled loop was 5-10 times
SLOWER than the simple nested loop that I used as a reference. I found the
following quote in Eric Gunnerson's book (2nd ed). He said (about unrolled
loops), a "function is so big it doesn't fit into cache and therefore gets
slower performance."

Jon added the following info:
I suspect he means the processor's instruction
cache - if it's a small loop, the processor (not JIT, note!) can decode
the instructions once and keep the microcode available for further
iterations.

Unrolling should be performed by the compiler. It only makes sense for
extremely tight loops, where loop overhead matters (loop overhead is
typically 2 cycles, the cost of 'inc, cmp, jcc'. Unrolling more than 2-4
times has been something that has not been really that great since 386/486.
(any modern processor has branch prediction)
2. VS.NET has a 2048 char line limit in code files. In my experience, when
using runtime code generation and compiling, this can cause unusual trouble
if one isn't aware of it. I was able to compile and run an application that
was created from a code file with a line longer than this limit. The only
error I encountered was that the mathematical expression in the executing
program returned incorrect results! (The fix was simply inserting line
breaks into the code file.)

This would be a bug, attach a repro and send to the C# guys.
3. VS.NET has some compiler bugs that will result in the following error
message:
Additional information: Common Language Runtime detected an invalid program.

This appears to happen on complex mathematical expressions. However, in my
case the expression itself was not that complex. It was just long (but each
line wasn't too long). When I reached about 1000 subexpressions in a simple
sum of products expression, I encountered this compiler bug. Search on
Google Groups using the error message, and you'll see that others have
encountered it (frequently) when dealing with complex mathematical
expressions.

To my knowledge, there is no solution for this. Anyone from Microsoft care
to comment?

The JIT has a limit on local variables. You can get there in 2 ways, having
a lot of local variables in your code (goes over the 65534 IL limit) or the
compiler generating temps for evaluating expressions. Human written code
usually never hits these limits, although we do see them once in a while in
machine generated code (as nobody is really 'typing' down all that code).
When the JIT 'overflows', it will return an Invalid Program exception. The
only advice I can currently give you at the moment is write smaller
functions. I suspect you are hitting the temp limit.
 
Many people have commented on why the big statement might be hard for the
compiler to digest so I'm not going to add to that. But I thought I should
talk about loop unrolling a bit.

Remember the purpose of loop unrolling is to reduce the cost of testing the
loop variables on each iteration and the associated control flow. That's
all the overhead there is.

Generally when you unroll a loop you take several iterations and put it
directly into the body. Hopefully you can do this easily because you know
(for instance) that the number of iterations is always a multiple of 10, or
something like that (there's of course other ways too).

OK, so far all is goodness and joy. Here comes the but.

In the world of modern processors, bigger code is often slower code. At
some point the savings that you gained by having less control flow are not
sufficient to overcome the costs of extra cache misses due to having to load
in more code for the bigger (unrolled) algorithm.

So, the sweet spot is in between, you need to unroll enough to reduce the
cost (remember if you only unroll 10 iterations that's still at 90%
reduction in control flow cost, you'll have to unroll 100 iterations to get
to 99% and then 1000 to get to 99.9% -- diminishing returns if ever I saw
them) however you mustn't bloat the code to the point where you're taking an
undue number of cache misses in the course of running the program. I find
it unlikely that more than 10 unrolls would ever really be worth it, but of
course you'd have to measure to be sure for your case.

--
This posting is provided "AS IS" with no warranties, and confers no rights.

Rico Mariani
CLR Performance Architect




Mountain Bikn' Guy said:
Take some standard code such as shown below. It simply loops to add up a
series of terms and it produces the correct result.

// sum numbers with a loop
public int DoSumLooping(int iterations)
{
int result = 0;
for(int i = 1;i <=iterations;i++)
{
result += i;
}
return result;
}

Now translate this into a specific solution that doesn't use looping (and
use the same value for the number of iterations the loop performs). This
code returns an incorrect result. The method consists entirely of a very
straightforward code statement, but in this case .NET adds the numbers
incorrectly.
public double ComputeSum( )
{
// Brute force sum method
// For iterations == 10000
double sum = 0+ 1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 13+ 14+ 15+ ....
+ 9997+ 9998+ 9999+ 10000;
return sum;
}
The above method returns an incorrect result with any number of terms above
about 200. It will correctly add 1 + 2 + ... + 200, but it will NOT
correctly add 1 + 2 + ... + 1000.

I have just run across this, and I have not yet researched the possible
reasons for this behavior. It may be a known issue related to either stack
size or the length of a code line, but to my knowledge it hasn't been
discussed in any of the "popular" literature on C# and .NET. I need to write
code like this, so if anyone has already encountered this issue, please
advise me.


Here's another example that also creates problems, but of a somewhat
different nature. Take the following code and translate it into a specific,
non-looping method and try to execute it using reflection. It fails.

public double LoopToCompute()
{
double sumOfProducts = 0;
double grandTotal = 0;
for (int i = 0; i < maxRows; ++i)
{
for (int j = 0; j < maxCols; ++j)
{
sumOfProducts += coeff[j] * table[j];
}
a_point = sumOfProducts;
grandTotal += sumOfProducts;
sumOfProducts = 0;
}
return grandTotal;
}//LoopToCompute

The above code works -- but it's equivalent code with loops unrolled (shown
below) doesn't work unless the maxRows is set very small. For small values,
the 2 methods (above and below) produce identical results. There is nothing
"wrong" with the code in that sense. It's similar to the above situation. If
the "size" of the code statement or the number of code statements is too
large, .NET fails. In this case (using reflection) it doesn't return the
incorrect result, as the first example did. In this case, reflection calls
it an invalid program and refuses to run it (but only when the value of
maxRows is above about 250). The reason for this is probably
straightforward. However, I have the need to make statements like this for
performance reasons so I need a work-around. Any suggestions are
appreciated! All comments are appreciated.

public double DoBruteForceCompute()
{
double bruteForceSum = 0;

point1=coeff1*table[0][0] +coeff2*table[0][1] +coeff3*table[0][2]
+coeff4*table[0][3] +coeff5*table[0][4] +coeff6*table[0][5]
+coeff7*table[0][6] +coeff8*table[0][7] +coeff9*table[0][8]
+coeff10*table[0][9] +coeff11*table[0][10] +coeff12*table[0][11]
+coeff13*table[0][12] +coeff14*table[0][13] +coeff15*table[0][14]
+coeff16*table[0][15] +coeff17*table[0][16] +coeff18*table[0][17]
+coeff19*table[0][18] +coeff20*table[0][19] +coeff21*table[0][20]
+coeff22*table[0][21] +coeff23*table[0][22] +coeff24*table[0][23]
+coeff25*table[0][24] +coeff26*table[0][25] +coeff27*table[0][26]
+coeff28*table[0][27] +coeff29*table[0][28] +coeff30*table[0][29]
+coeff31*table[0][30] +coeff32*table[0][31] +coeff33*table[0][32]
+coeff34*table[0][33] +coeff35*table[0][34] ;

point2=coeff1*table[1][0] +coeff2*table[1][1] +coeff3*table[1][2]
+coeff4*table[1][3] +coeff5*table[1][4] +coeff6*table[1][5]
+coeff7*table[1][6] +coeff8*table[1][7] +coeff9*table[1][8]
+coeff10*table[1][9] +coeff11*table[1][10] +coeff12*table[1][11]
+coeff13*table[1][12] +coeff14*table[1][13] +coeff15*table[1][14]
+coeff16*table[1][15] +coeff17*table[1][16] +coeff18*table[1][17]
+coeff19*table[1][18] +coeff20*table[1][19] +coeff21*table[1][20]
+coeff22*table[1][21] +coeff23*table[1][22] +coeff24*table[1][23]
+coeff25*table[1][24] +coeff26*table[1][25] +coeff27*table[1][26]
+coeff28*table[1][27] +coeff29*table[1][28] +coeff30*table[1][29]
+coeff31*table[1][30] +coeff32*table[1][31] +coeff33*table[1][32]
+coeff34*table[1][33] +coeff35*table[1][34] ;


[...]

point500=coeff1*table[499][0] +coeff2*table[499][1] +coeff3*table[499][2]
+coeff4*table[499][3] +coeff5*table[499][4] +coeff6*table[499][5]
+coeff7*table[499][6] +coeff8*table[499][7] +coeff9*table[499][8]
+coeff10*table[499][9] +coeff11*table[499][10] +coeff12*table[499][11]
+coeff13*table[499][12] +coeff14*table[499][13] +coeff15*table[499][14]
+coeff16*table[499][15] +coeff17*table[499][16] +coeff18*table[499][17]
+coeff19*table[499][18] +coeff20*table[499][19] +coeff21*table[499][20]
+coeff22*table[499][21] +coeff23*table[499][22] +coeff24*table[499][23]
+coeff25*table[499][24] +coeff26*table[499][25] +coeff27*table[499][26]
+coeff28*table[499][27] +coeff29*table[499][28] +coeff30*table[499][29]
+coeff31*table[499][30] +coeff32*table[499][31] +coeff33*table[499][32]
+coeff34*table[499][33] +coeff35*table[499][34] ;

bruteForceSum =
point1 +
point2 + ... +

point499 +
point500
;

return bruteForceSum;

}//DoBruteForceCompute
 
I'm going to add one person's actual experience into these posts about the
theoretical advantages of loop unrolling on modern processors. In the case
of a nested loop (2 for-loops) each iterating hundreds or thousands of
times, here is what I found:
1. unrolling both loops resulted in slower performance -- sometimes it was
significantly slower. It was the slowest of all variations I tested.
2. unrolling just the inner loop -- even when this resulted in a LOT of code
statements -- was the fastest of all options. I compared it against the
normal loop and several other variations. The unrolled inner loop had
hundreds of statements (typically about 500 or more). Yet, it was still the
fastest of all variations I could come up with. On average, it was 50% to
100% faster than the standard nested loop code.

In the end, I took a solution that was unusable because it was so slow (8
minutes+ to react to a single number change) to a solution that responds
virtually instantly (a fraction of a second now). Many things were part of
this -- unrolling the loop was just one of many.

BTW, I did the testing on a Xeon cpu machine.

Rico Mariani said:
Many people have commented on why the big statement might be hard for the
compiler to digest so I'm not going to add to that. But I thought I should
talk about loop unrolling a bit.

Remember the purpose of loop unrolling is to reduce the cost of testing the
loop variables on each iteration and the associated control flow. That's
all the overhead there is.

Generally when you unroll a loop you take several iterations and put it
directly into the body. Hopefully you can do this easily because you know
(for instance) that the number of iterations is always a multiple of 10, or
something like that (there's of course other ways too).

OK, so far all is goodness and joy. Here comes the but.

In the world of modern processors, bigger code is often slower code. At
some point the savings that you gained by having less control flow are not
sufficient to overcome the costs of extra cache misses due to having to load
in more code for the bigger (unrolled) algorithm.

So, the sweet spot is in between, you need to unroll enough to reduce the
cost (remember if you only unroll 10 iterations that's still at 90%
reduction in control flow cost, you'll have to unroll 100 iterations to get
to 99% and then 1000 to get to 99.9% -- diminishing returns if ever I saw
them) however you mustn't bloat the code to the point where you're taking an
undue number of cache misses in the course of running the program. I find
it unlikely that more than 10 unrolls would ever really be worth it, but of
course you'd have to measure to be sure for your case.

--
This posting is provided "AS IS" with no warranties, and confers no rights.

Rico Mariani
CLR Performance Architect




Mountain Bikn' Guy said:
Take some standard code such as shown below. It simply loops to add up a
series of terms and it produces the correct result.

// sum numbers with a loop
public int DoSumLooping(int iterations)
{
int result = 0;
for(int i = 1;i <=iterations;i++)
{
result += i;
}
return result;
}

Now translate this into a specific solution that doesn't use looping (and
use the same value for the number of iterations the loop performs). This
code returns an incorrect result. The method consists entirely of a very
straightforward code statement, but in this case .NET adds the numbers
incorrectly.
public double ComputeSum( )
{
// Brute force sum method
// For iterations == 10000
double sum = 0+ 1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 13+ 14+ 15+ ...
+ 9997+ 9998+ 9999+ 10000;
return sum;
}
The above method returns an incorrect result with any number of terms above
about 200. It will correctly add 1 + 2 + ... + 200, but it will NOT
correctly add 1 + 2 + ... + 1000.

I have just run across this, and I have not yet researched the possible
reasons for this behavior. It may be a known issue related to either stack
size or the length of a code line, but to my knowledge it hasn't been
discussed in any of the "popular" literature on C# and .NET. I need to write
code like this, so if anyone has already encountered this issue, please
advise me.


Here's another example that also creates problems, but of a somewhat
different nature. Take the following code and translate it into a specific,
non-looping method and try to execute it using reflection. It fails.

public double LoopToCompute()
{
double sumOfProducts = 0;
double grandTotal = 0;
for (int i = 0; i < maxRows; ++i)
{
for (int j = 0; j < maxCols; ++j)
{
sumOfProducts += coeff[j] * table[j];
}
a_point = sumOfProducts;
grandTotal += sumOfProducts;
sumOfProducts = 0;
}
return grandTotal;
}//LoopToCompute

The above code works -- but it's equivalent code with loops unrolled (shown
below) doesn't work unless the maxRows is set very small. For small values,
the 2 methods (above and below) produce identical results. There is nothing
"wrong" with the code in that sense. It's similar to the above
situation.
If
the "size" of the code statement or the number of code statements is too
large, .NET fails. In this case (using reflection) it doesn't return the
incorrect result, as the first example did. In this case, reflection calls
it an invalid program and refuses to run it (but only when the value of
maxRows is above about 250). The reason for this is probably
straightforward. However, I have the need to make statements like this for
performance reasons so I need a work-around. Any suggestions are
appreciated! All comments are appreciated.

public double DoBruteForceCompute()
{
double bruteForceSum = 0;

point1=coeff1*table[0][0] +coeff2*table[0][1] +coeff3*table[0][2]
+coeff4*table[0][3] +coeff5*table[0][4] +coeff6*table[0][5]
+coeff7*table[0][6] +coeff8*table[0][7] +coeff9*table[0][8]
+coeff10*table[0][9] +coeff11*table[0][10] +coeff12*table[0][11]
+coeff13*table[0][12] +coeff14*table[0][13] +coeff15*table[0][14]
+coeff16*table[0][15] +coeff17*table[0][16] +coeff18*table[0][17]
+coeff19*table[0][18] +coeff20*table[0][19] +coeff21*table[0][20]
+coeff22*table[0][21] +coeff23*table[0][22] +coeff24*table[0][23]
+coeff25*table[0][24] +coeff26*table[0][25] +coeff27*table[0][26]
+coeff28*table[0][27] +coeff29*table[0][28] +coeff30*table[0][29]
+coeff31*table[0][30] +coeff32*table[0][31] +coeff33*table[0][32]
+coeff34*table[0][33] +coeff35*table[0][34] ;

point2=coeff1*table[1][0] +coeff2*table[1][1] +coeff3*table[1][2]
+coeff4*table[1][3] +coeff5*table[1][4] +coeff6*table[1][5]
+coeff7*table[1][6] +coeff8*table[1][7] +coeff9*table[1][8]
+coeff10*table[1][9] +coeff11*table[1][10] +coeff12*table[1][11]
+coeff13*table[1][12] +coeff14*table[1][13] +coeff15*table[1][14]
+coeff16*table[1][15] +coeff17*table[1][16] +coeff18*table[1][17]
+coeff19*table[1][18] +coeff20*table[1][19] +coeff21*table[1][20]
+coeff22*table[1][21] +coeff23*table[1][22] +coeff24*table[1][23]
+coeff25*table[1][24] +coeff26*table[1][25] +coeff27*table[1][26]
+coeff28*table[1][27] +coeff29*table[1][28] +coeff30*table[1][29]
+coeff31*table[1][30] +coeff32*table[1][31] +coeff33*table[1][32]
+coeff34*table[1][33] +coeff35*table[1][34] ;


[...]

point500=coeff1*table[499][0] +coeff2*table[499][1] +coeff3*table[499][2]
+coeff4*table[499][3] +coeff5*table[499][4] +coeff6*table[499][5]
+coeff7*table[499][6] +coeff8*table[499][7] +coeff9*table[499][8]
+coeff10*table[499][9] +coeff11*table[499][10] +coeff12*table[499][11]
+coeff13*table[499][12] +coeff14*table[499][13] +coeff15*table[499][14]
+coeff16*table[499][15] +coeff17*table[499][16] +coeff18*table[499][17]
+coeff19*table[499][18] +coeff20*table[499][19] +coeff21*table[499][20]
+coeff22*table[499][21] +coeff23*table[499][22] +coeff24*table[499][23]
+coeff25*table[499][24] +coeff26*table[499][25] +coeff27*table[499][26]
+coeff28*table[499][27] +coeff29*table[499][28] +coeff30*table[499][29]
+coeff31*table[499][30] +coeff32*table[499][31] +coeff33*table[499][32]
+coeff34*table[499][33] +coeff35*table[499][34] ;

bruteForceSum =
point1 +
point2 + ... +

point499 +
point500
;

return bruteForceSum;

}//DoBruteForceCompute
 
What amount of loop unrolling does the JIT perform? Does ngen perform any
more optimisations or is it just the JIT?

--


-----------
Got TidBits?
Get it here: www.networkip.net/tidbits
Mountain Bikn' Guy said:
I'm going to add one person's actual experience into these posts about the
theoretical advantages of loop unrolling on modern processors. In the case
of a nested loop (2 for-loops) each iterating hundreds or thousands of
times, here is what I found:
1. unrolling both loops resulted in slower performance -- sometimes it was
significantly slower. It was the slowest of all variations I tested.
2. unrolling just the inner loop -- even when this resulted in a LOT of code
statements -- was the fastest of all options. I compared it against the
normal loop and several other variations. The unrolled inner loop had
hundreds of statements (typically about 500 or more). Yet, it was still the
fastest of all variations I could come up with. On average, it was 50% to
100% faster than the standard nested loop code.

In the end, I took a solution that was unusable because it was so slow (8
minutes+ to react to a single number change) to a solution that responds
virtually instantly (a fraction of a second now). Many things were part of
this -- unrolling the loop was just one of many.

BTW, I did the testing on a Xeon cpu machine.

Rico Mariani said:
Many people have commented on why the big statement might be hard for the
compiler to digest so I'm not going to add to that. But I thought I should
talk about loop unrolling a bit.

Remember the purpose of loop unrolling is to reduce the cost of testing the
loop variables on each iteration and the associated control flow. That's
all the overhead there is.

Generally when you unroll a loop you take several iterations and put it
directly into the body. Hopefully you can do this easily because you know
(for instance) that the number of iterations is always a multiple of 10, or
something like that (there's of course other ways too).

OK, so far all is goodness and joy. Here comes the but.

In the world of modern processors, bigger code is often slower code. At
some point the savings that you gained by having less control flow are not
sufficient to overcome the costs of extra cache misses due to having to load
in more code for the bigger (unrolled) algorithm.

So, the sweet spot is in between, you need to unroll enough to reduce the
cost (remember if you only unroll 10 iterations that's still at 90%
reduction in control flow cost, you'll have to unroll 100 iterations to get
to 99% and then 1000 to get to 99.9% -- diminishing returns if ever I saw
them) however you mustn't bloat the code to the point where you're
taking
an
undue number of cache misses in the course of running the program. I find
it unlikely that more than 10 unrolls would ever really be worth it, but of
course you'd have to measure to be sure for your case.

--
This posting is provided "AS IS" with no warranties, and confers no rights.

Rico Mariani
CLR Performance Architect




Mountain Bikn' Guy said:
Take some standard code such as shown below. It simply loops to add up a
series of terms and it produces the correct result.

// sum numbers with a loop
public int DoSumLooping(int iterations)
{
int result = 0;
for(int i = 1;i <=iterations;i++)
{
result += i;
}
return result;
}

Now translate this into a specific solution that doesn't use looping (and
use the same value for the number of iterations the loop performs). This
code returns an incorrect result. The method consists entirely of a very
straightforward code statement, but in this case .NET adds the numbers
incorrectly.
public double ComputeSum( )
{
// Brute force sum method
// For iterations == 10000
double sum = 0+ 1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 13+ 14+
15+
...
+ 9997+ 9998+ 9999+ 10000;
return sum;
}
The above method returns an incorrect result with any number of terms above
about 200. It will correctly add 1 + 2 + ... + 200, but it will NOT
correctly add 1 + 2 + ... + 1000.

I have just run across this, and I have not yet researched the possible
reasons for this behavior. It may be a known issue related to either stack
size or the length of a code line, but to my knowledge it hasn't been
discussed in any of the "popular" literature on C# and .NET. I need to write
code like this, so if anyone has already encountered this issue, please
advise me.


Here's another example that also creates problems, but of a somewhat
different nature. Take the following code and translate it into a specific,
non-looping method and try to execute it using reflection. It fails.

public double LoopToCompute()
{
double sumOfProducts = 0;
double grandTotal = 0;
for (int i = 0; i < maxRows; ++i)
{
for (int j = 0; j < maxCols; ++j)
{
sumOfProducts += coeff[j] * table[j];
}
a_point = sumOfProducts;
grandTotal += sumOfProducts;
sumOfProducts = 0;
}
return grandTotal;
}//LoopToCompute

The above code works -- but it's equivalent code with loops unrolled (shown
below) doesn't work unless the maxRows is set very small. For small values,
the 2 methods (above and below) produce identical results. There is nothing
"wrong" with the code in that sense. It's similar to the above
situation.
If
the "size" of the code statement or the number of code statements is too
large, .NET fails. In this case (using reflection) it doesn't return the
incorrect result, as the first example did. In this case, reflection calls
it an invalid program and refuses to run it (but only when the value of
maxRows is above about 250). The reason for this is probably
straightforward. However, I have the need to make statements like this for
performance reasons so I need a work-around. Any suggestions are
appreciated! All comments are appreciated.

public double DoBruteForceCompute()
{
double bruteForceSum = 0;

point1=coeff1*table[0][0] +coeff2*table[0][1] +coeff3*table[0][2]
+coeff4*table[0][3] +coeff5*table[0][4] +coeff6*table[0][5]
+coeff7*table[0][6] +coeff8*table[0][7] +coeff9*table[0][8]
+coeff10*table[0][9] +coeff11*table[0][10] +coeff12*table[0][11]
+coeff13*table[0][12] +coeff14*table[0][13] +coeff15*table[0][14]
+coeff16*table[0][15] +coeff17*table[0][16] +coeff18*table[0][17]
+coeff19*table[0][18] +coeff20*table[0][19] +coeff21*table[0][20]
+coeff22*table[0][21] +coeff23*table[0][22] +coeff24*table[0][23]
+coeff25*table[0][24] +coeff26*table[0][25] +coeff27*table[0][26]
+coeff28*table[0][27] +coeff29*table[0][28] +coeff30*table[0][29]
+coeff31*table[0][30] +coeff32*table[0][31] +coeff33*table[0][32]
+coeff34*table[0][33] +coeff35*table[0][34] ;

point2=coeff1*table[1][0] +coeff2*table[1][1] +coeff3*table[1][2]
+coeff4*table[1][3] +coeff5*table[1][4] +coeff6*table[1][5]
+coeff7*table[1][6] +coeff8*table[1][7] +coeff9*table[1][8]
+coeff10*table[1][9] +coeff11*table[1][10] +coeff12*table[1][11]
+coeff13*table[1][12] +coeff14*table[1][13] +coeff15*table[1][14]
+coeff16*table[1][15] +coeff17*table[1][16] +coeff18*table[1][17]
+coeff19*table[1][18] +coeff20*table[1][19] +coeff21*table[1][20]
+coeff22*table[1][21] +coeff23*table[1][22] +coeff24*table[1][23]
+coeff25*table[1][24] +coeff26*table[1][25] +coeff27*table[1][26]
+coeff28*table[1][27] +coeff29*table[1][28] +coeff30*table[1][29]
+coeff31*table[1][30] +coeff32*table[1][31] +coeff33*table[1][32]
+coeff34*table[1][33] +coeff35*table[1][34] ;


[...]

point500=coeff1*table[499][0] +coeff2*table[499][1] +coeff3*table[499][2]
+coeff4*table[499][3] +coeff5*table[499][4] +coeff6*table[499][5]
+coeff7*table[499][6] +coeff8*table[499][7] +coeff9*table[499][8]
+coeff10*table[499][9] +coeff11*table[499][10] +coeff12*table[499][11]
+coeff13*table[499][12] +coeff14*table[499][13] +coeff15*table[499][14]
+coeff16*table[499][15] +coeff17*table[499][16] +coeff18*table[499][17]
+coeff19*table[499][18] +coeff20*table[499][19] +coeff21*table[499][20]
+coeff22*table[499][21] +coeff23*table[499][22] +coeff24*table[499][23]
+coeff25*table[499][24] +coeff26*table[499][25] +coeff27*table[499][26]
+coeff28*table[499][27] +coeff29*table[499][28] +coeff30*table[499][29]
+coeff31*table[499][30] +coeff32*table[499][31] +coeff33*table[499][32]
+coeff34*table[499][33] +coeff35*table[499][34] ;

bruteForceSum =
point1 +
point2 + ... +

point499 +
point500
;

return bruteForceSum;

}//DoBruteForceCompute

 
Ignore this idiot. He is spoofing my email account.



--


-----------
Got TidBits?
Get it here: www.networkip.net/tidbits
Alvin Bruney said:
What amount of loop unrolling does the JIT perform? Does ngen perform any
more optimisations or is it just the JIT?

--


-----------
Got TidBits?
Get it here: www.networkip.net/tidbits
Mountain Bikn' Guy said:
I'm going to add one person's actual experience into these posts about the
theoretical advantages of loop unrolling on modern processors. In the case
of a nested loop (2 for-loops) each iterating hundreds or thousands of
times, here is what I found:
1. unrolling both loops resulted in slower performance -- sometimes it was
significantly slower. It was the slowest of all variations I tested.
2. unrolling just the inner loop -- even when this resulted in a LOT of code
statements -- was the fastest of all options. I compared it against the
normal loop and several other variations. The unrolled inner loop had
hundreds of statements (typically about 500 or more). Yet, it was still the
fastest of all variations I could come up with. On average, it was 50% to
100% faster than the standard nested loop code.

In the end, I took a solution that was unusable because it was so slow (8
minutes+ to react to a single number change) to a solution that responds
virtually instantly (a fraction of a second now). Many things were part of
this -- unrolling the loop was just one of many.

BTW, I did the testing on a Xeon cpu machine.

testing
the 10,
or to
load to
get taking but
of
up
a
series of terms and it produces the correct result.

// sum numbers with a loop
public int DoSumLooping(int iterations)
{
int result = 0;
for(int i = 1;i <=iterations;i++)
{
result += i;
}
return result;
}

Now translate this into a specific solution that doesn't use looping (and
use the same value for the number of iterations the loop performs). This
code returns an incorrect result. The method consists entirely of a very
straightforward code statement, but in this case .NET adds the numbers
incorrectly.
public double ComputeSum( )
{
// Brute force sum method
// For iterations == 10000
double sum = 0+ 1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 13+ 14+ 15+
...
+ 9997+ 9998+ 9999+ 10000;
return sum;
}
The above method returns an incorrect result with any number of terms
above
about 200. It will correctly add 1 + 2 + ... + 200, but it will NOT
correctly add 1 + 2 + ... + 1000.

I have just run across this, and I have not yet researched the possible
reasons for this behavior. It may be a known issue related to either stack
size or the length of a code line, but to my knowledge it hasn't been
discussed in any of the "popular" literature on C# and .NET. I need to
write
code like this, so if anyone has already encountered this issue, please
advise me.


Here's another example that also creates problems, but of a somewhat
different nature. Take the following code and translate it into a
specific,
non-looping method and try to execute it using reflection. It fails.

public double LoopToCompute()
{
double sumOfProducts = 0;
double grandTotal = 0;
for (int i = 0; i < maxRows; ++i)
{
for (int j = 0; j < maxCols; ++j)
{
sumOfProducts += coeff[j] * table[j];
}
a_point = sumOfProducts;
grandTotal += sumOfProducts;
sumOfProducts = 0;
}
return grandTotal;
}//LoopToCompute

The above code works -- but it's equivalent code with loops unrolled
(shown
below) doesn't work unless the maxRows is set very small. For small
values,
the 2 methods (above and below) produce identical results. There is
nothing
"wrong" with the code in that sense. It's similar to the above situation.
If
the "size" of the code statement or the number of code statements is too
large, .NET fails. In this case (using reflection) it doesn't return the
incorrect result, as the first example did. In this case, reflection calls
it an invalid program and refuses to run it (but only when the value of
maxRows is above about 250). The reason for this is probably
straightforward. However, I have the need to make statements like
this
for
performance reasons so I need a work-around. Any suggestions are
appreciated! All comments are appreciated.

public double DoBruteForceCompute()
{
double bruteForceSum = 0;

point1=coeff1*table[0][0] +coeff2*table[0][1] +coeff3*table[0][2]
+coeff4*table[0][3] +coeff5*table[0][4] +coeff6*table[0][5]
+coeff7*table[0][6] +coeff8*table[0][7] +coeff9*table[0][8]
+coeff10*table[0][9] +coeff11*table[0][10] +coeff12*table[0][11]
+coeff13*table[0][12] +coeff14*table[0][13] +coeff15*table[0][14]
+coeff16*table[0][15] +coeff17*table[0][16] +coeff18*table[0][17]
+coeff19*table[0][18] +coeff20*table[0][19] +coeff21*table[0][20]
+coeff22*table[0][21] +coeff23*table[0][22] +coeff24*table[0][23]
+coeff25*table[0][24] +coeff26*table[0][25] +coeff27*table[0][26]
+coeff28*table[0][27] +coeff29*table[0][28] +coeff30*table[0][29]
+coeff31*table[0][30] +coeff32*table[0][31] +coeff33*table[0][32]
+coeff34*table[0][33] +coeff35*table[0][34] ;

point2=coeff1*table[1][0] +coeff2*table[1][1] +coeff3*table[1][2]
+coeff4*table[1][3] +coeff5*table[1][4] +coeff6*table[1][5]
+coeff7*table[1][6] +coeff8*table[1][7] +coeff9*table[1][8]
+coeff10*table[1][9] +coeff11*table[1][10] +coeff12*table[1][11]
+coeff13*table[1][12] +coeff14*table[1][13] +coeff15*table[1][14]
+coeff16*table[1][15] +coeff17*table[1][16] +coeff18*table[1][17]
+coeff19*table[1][18] +coeff20*table[1][19] +coeff21*table[1][20]
+coeff22*table[1][21] +coeff23*table[1][22] +coeff24*table[1][23]
+coeff25*table[1][24] +coeff26*table[1][25] +coeff27*table[1][26]
+coeff28*table[1][27] +coeff29*table[1][28] +coeff30*table[1][29]
+coeff31*table[1][30] +coeff32*table[1][31] +coeff33*table[1][32]
+coeff34*table[1][33] +coeff35*table[1][34] ;


[...]

point500=coeff1*table[499][0] +coeff2*table[499][1] +coeff3*table[499][2]
+coeff4*table[499][3] +coeff5*table[499][4] +coeff6*table[499][5]
+coeff7*table[499][6] +coeff8*table[499][7] +coeff9*table[499][8]
+coeff10*table[499][9] +coeff11*table[499][10] +coeff12*table[499][11]
+coeff13*table[499][12] +coeff14*table[499][13] +coeff15*table[499][14]
+coeff16*table[499][15] +coeff17*table[499][16] +coeff18*table[499][17]
+coeff19*table[499][18] +coeff20*table[499][19] +coeff21*table[499][20]
+coeff22*table[499][21] +coeff23*table[499][22] +coeff24*table[499][23]
+coeff25*table[499][24] +coeff26*table[499][25] +coeff27*table[499][26]
+coeff28*table[499][27] +coeff29*table[499][28] +coeff30*table[499][29]
+coeff31*table[499][30] +coeff32*table[499][31] +coeff33*table[499][32]
+coeff34*table[499][33] +coeff35*table[499][34] ;

bruteForceSum =
point1 +
point2 + ... +

point499 +
point500
;

return bruteForceSum;

}//DoBruteForceCompute

 
Yes Ignore that idiot, like he thinks he owns the internet :D

--


-----------
Got TidBits?
Get it here: www.networkip.net/tidbits
Alvin Bruney said:
Ignore this idiot. He is spoofing my email account.



--


-----------
Got TidBits?
Get it here: www.networkip.net/tidbits
Alvin Bruney said:
What amount of loop unrolling does the JIT perform? Does ngen perform any
more optimisations or is it just the JIT?

--


-----------
Got TidBits?
Get it here: www.networkip.net/tidbits
of
code still
the
part
of for
the you
know are
not reduce
the I
saw I
find
add
up performs).
This a
very 14+
15+
need
to
write
code like this, so if anyone has already encountered this issue, please
advise me.


Here's another example that also creates problems, but of a somewhat
different nature. Take the following code and translate it into a
specific,
non-looping method and try to execute it using reflection. It fails.

public double LoopToCompute()
{
double sumOfProducts = 0;
double grandTotal = 0;
for (int i = 0; i < maxRows; ++i)
{
for (int j = 0; j < maxCols; ++j)
{
sumOfProducts += coeff[j] * table[j];
}
a_point = sumOfProducts;
grandTotal += sumOfProducts;
sumOfProducts = 0;
}
return grandTotal;
}//LoopToCompute

The above code works -- but it's equivalent code with loops unrolled
(shown
below) doesn't work unless the maxRows is set very small. For small
values,
the 2 methods (above and below) produce identical results. There is
nothing
"wrong" with the code in that sense. It's similar to the above
situation.
If
the "size" of the code statement or the number of code statements
is
too
large, .NET fails. In this case (using reflection) it doesn't
return
the
incorrect result, as the first example did. In this case, reflection
calls
it an invalid program and refuses to run it (but only when the
value
of
maxRows is above about 250). The reason for this is probably
straightforward. However, I have the need to make statements like this
for
performance reasons so I need a work-around. Any suggestions are
appreciated! All comments are appreciated.

public double DoBruteForceCompute()
{
double bruteForceSum = 0;

point1=coeff1*table[0][0] +coeff2*table[0][1] +coeff3*table[0][2]
+coeff4*table[0][3] +coeff5*table[0][4] +coeff6*table[0][5]
+coeff7*table[0][6] +coeff8*table[0][7] +coeff9*table[0][8]
+coeff10*table[0][9] +coeff11*table[0][10] +coeff12*table[0][11]
+coeff13*table[0][12] +coeff14*table[0][13] +coeff15*table[0][14]
+coeff16*table[0][15] +coeff17*table[0][16] +coeff18*table[0][17]
+coeff19*table[0][18] +coeff20*table[0][19] +coeff21*table[0][20]
+coeff22*table[0][21] +coeff23*table[0][22] +coeff24*table[0][23]
+coeff25*table[0][24] +coeff26*table[0][25] +coeff27*table[0][26]
+coeff28*table[0][27] +coeff29*table[0][28] +coeff30*table[0][29]
+coeff31*table[0][30] +coeff32*table[0][31] +coeff33*table[0][32]
+coeff34*table[0][33] +coeff35*table[0][34] ;

point2=coeff1*table[1][0] +coeff2*table[1][1] +coeff3*table[1][2]
+coeff4*table[1][3] +coeff5*table[1][4] +coeff6*table[1][5]
+coeff7*table[1][6] +coeff8*table[1][7] +coeff9*table[1][8]
+coeff10*table[1][9] +coeff11*table[1][10] +coeff12*table[1][11]
+coeff13*table[1][12] +coeff14*table[1][13] +coeff15*table[1][14]
+coeff16*table[1][15] +coeff17*table[1][16] +coeff18*table[1][17]
+coeff19*table[1][18] +coeff20*table[1][19] +coeff21*table[1][20]
+coeff22*table[1][21] +coeff23*table[1][22] +coeff24*table[1][23]
+coeff25*table[1][24] +coeff26*table[1][25] +coeff27*table[1][26]
+coeff28*table[1][27] +coeff29*table[1][28] +coeff30*table[1][29]
+coeff31*table[1][30] +coeff32*table[1][31] +coeff33*table[1][32]
+coeff34*table[1][33] +coeff35*table[1][34] ;


[...]

point500=coeff1*table[499][0] +coeff2*table[499][1]
+coeff3*table[499][2]
+coeff4*table[499][3] +coeff5*table[499][4] +coeff6*table[499][5]
+coeff7*table[499][6] +coeff8*table[499][7] +coeff9*table[499][8]
+coeff10*table[499][9] +coeff11*table[499][10] +coeff12*table[499][11]
+coeff13*table[499][12] +coeff14*table[499][13] +coeff15*table[499][14]
+coeff16*table[499][15] +coeff17*table[499][16] +coeff18*table[499][17]
+coeff19*table[499][18] +coeff20*table[499][19] +coeff21*table[499][20]
+coeff22*table[499][21] +coeff23*table[499][22] +coeff24*table[499][23]
+coeff25*table[499][24] +coeff26*table[499][25] +coeff27*table[499][26]
+coeff28*table[499][27] +coeff29*table[499][28] +coeff30*table[499][29]
+coeff31*table[499][30] +coeff32*table[499][31] +coeff33*table[499][32]
+coeff34*table[499][33] +coeff35*table[499][34] ;

bruteForceSum =
point1 +
point2 + ... +

point499 +
point500
;

return bruteForceSum;

}//DoBruteForceCompute

 
Back
Top