C# (float) cast is costly for speed if not used appropriately

  • Thread starter Thread starter Arnie
  • Start date Start date
A

Arnie

Folks,

We ran into a pretty significant performance penalty when casting floats.
We've identified a code workaround that we wanted to pass along but also was
wondering if others had experience with this and if there is a better
solution.

-jeff


.....


I'd like to share findings regarding C# (float) cast.

As we convert double to float, we found several slow down issues.
We realized C# (float) cast can be costly if not used appropriately.

------------------------------------------------------------
Slow cases
------------------------------------------------------------
(A)
private void someMath(float[] input, float[] output)
{
int length = input.Length;
for (int i = 0; i < length; i++)
{
output = (float)Math.Log10(input); // <--- inline (float)
cast is slow!
}
}

(B)
private void Copy(double[] input, float[] output)
{
int length = input.Length;
for (int i = 0; i < length; i++)
{
output = (float)input; // <--- inline (float)
cast is slow!
}
}

In these examples, "inline" (float) casts are executed on the same line as
other operations
such as Math.Log10() or simple data fetch from input array.

These are slow. Even with Release build.
(A): It takes 3 to 6 % more than double[] case. ;-)
(B): It takes as twice(!) as double[] case. ;-)

In my understanding and articles on the Net, the slow down comes from
writing intermediate value
back to memory as follows. The extra trips are costly.


(A) CPU/FPU +--> fetch --> Math.Log10 --+ +--> (float) --+
| | | |
| | | |
| V | V
memory input written back to heap output
Extra memory access!

(B) CPU/FPU +--> fetch --+ +--> (float) --+
| | | |
| | | |
| V | V
memory input written back to heap output
Extra memory access!

------------------------------------------------------------
Fast cases
------------------------------------------------------------

To avoid the extra memory access, we can use a temporary variable to store
the intermediate data.
The temporary variable is allocated in CPU register and we can keep the
speed fast.

(C)
private void someMath(float[] input, float[] output)
{
int length = input.Length;
for (int i = 0; i < length; i++)
{
double tmp = Math.Log10(input); // <-- store in a
temporary variable in CPU register
output = (float)tmp; // <-- then (float) cast.
Fast!
}
}

(D)
private void Copy(double[] input, float[] output)
{
int length = input.Length;
for (int i = 0; i < length; i++)
{
double tmp = input; // <-- store in a
temporary variable in CPU register
output = (float)tmp; // <-- then (float) cast.
Fast!
}
}

In these improved versions, the intermediate data are not written back to
the memory.
The improved versions are actually slightly faster than the double[] case.
(C): 1% faster than double[] case. :)
(D): 3% faster than double[] case. :)

(C) CPU/FPU +--> fetch --> Math.Log10 --> stays in -----> (float) --+
| CPU register |
| Fast! |
| V
memory input
output

(D) CPU/FPU +--> fetch --> stays in -----> (float) --+
| CPU register |
| Fast! |
| V
memory input output


OK, this is what we found from benchmarking and googling.

The same thing can be said for ArraySegment<float> arrays as well.
This is because the issue relates to float variables in the array, not the
array itself.

You would say this is .NET compiler optimization issue.
If you know optimization flags or anything that can fix this issue on
compiler side, please let us know.
That would be a great help!
(By the way, simple release build does not help.)

Otherwise, we will need to optimize our code by hand using temporary
variable technique as in the example.
Well, we have many instances of this kind of "inline" casts in our code.
 
Arnie said:
We ran into a pretty significant performance penalty when casting floats.

To be honest, it doesn't really sound that significant to me. Read
on...
We've identified a code workaround that we wanted to pass along but also was
wondering if others had experience with this and if there is a better
solution.

I'd like to share findings regarding C# (float) cast.

As we convert double to float, we found several slow down issues.
We realized C# (float) cast can be costly if not used appropriately.

In my understanding and articles on the Net, the slow down comes from
writing intermediate value back to memory as follows. The extra trips
are costly.

I see no reason to believe that there's an extra value written to the
*heap* (rather than the stack), and no reason why the JIT shouldn't use
a register for the intermediate value without an explicit local
variable.

<snip>

I have included a short but complete program below which uses an array
of a million elements and iterates each method a thousand times. Here
are the results on my laptop:

Log10Fast: 64489ms
Log10Slow: 70420ms
CopyFast: 3841ms
CopySlow: 4070ms

So your optimisation improves things by about 10% for the Log10 case
and about 5% for the Copy case.
Otherwise, we will need to optimize our code by hand using temporary
variable technique as in the example.
Well, we have many instances of this kind of "inline" casts in our code.

And have you any reason to believe that's *actually* the bottleneck in
your code? Do you regularly convert a billion floats and care about
200ms of performance loss?

I don't understand why the results are as they are (it would be worth
looking at the JITted, optimised code to find out) - but even so, I
certainly wouldn't start micro-optimising all over the place. Find out
where the *actual* bottleneck in your code is, and consider reducing
readability/simplicity for the sake of performance just in the most
significant parts. Don't start doing it all over the place, which
sounds like the course of action you're considering at the moment.
 
Thanks for the feedback Jon.

This is a mature system where they are "ringing" out the last bit of
performance.

It is a scientific test insrument (spectrum analyzer) so they are acquiring
and converting extremely large chunks of data (wave forms). Some runs can
acquire as much as 500MB of data at a time.

So they have "progressed" to the point where they are looking at the right
optimization spots in their code. Casting from double[] is indeed 2x as slow
without the optimization and quite different then the 5-10% case you
demonstrated. Again the only thing they changed was assigning a local
variable hence their curiosity in what the C#/jit compiler is doing.

I think our premise is that given this single change .... it would seem that
the there would be no performance difference if the compiler were taking
advantage of every reasonable performance optimization.

Time to look at IL as see what is going on.

-jeff
 
Arnie said:
This is a mature system where they are "ringing" out the last bit of
performance.

Hmm... it still sounds dubious to me. I doubt you'll really see
performance benefits which are significant in the context of the whole
app. Mind you, it sounds like you're not seeing the same behaviour as
me to start with, so hey...
It is a scientific test insrument (spectrum analyzer) so they are acquiring
and converting extremely large chunks of data (wave forms). Some runs can
acquire as much as 500MB of data at a time.

500MB isn't that much though, in the context of the tests I was doing -
it was using a billion points of data, which would be 8GB. The copy was
then only taking 4 seconds, and making the change only shaved off a
very small amount.
So they have "progressed" to the point where they are looking at the right
optimization spots in their code. Casting from double[] is indeed 2x as slow
without the optimization and quite different then the 5-10% case you
demonstrated.

So can you give a short but complete program which *does* demonstrate
the 2x difference?

Just as a thought, which CLR are you using? I'm on the 2.0, on x86. If
you're using 1.0, 1.1, or 2.0 on x64, that could account for some
differences.
Again the only thing they changed was assigning a local
variable hence their curiosity in what the C#/jit compiler is doing.

I think our premise is that given this single change .... it would seem that
the there would be no performance difference if the compiler were taking
advantage of every reasonable performance optimization.

Time to look at IL as see what is going on.

The IL doesn't show much. It's the optimised assembly you need to be
looking at, really. cordbg is your friend - but don't forget to tell it
to perform JIT optimisations. SOS may help too. I don't envy you...
 
Back
Top