byte[] to byte*... What is the fastest way?

  • Thread starter Thread starter ThunderMusic
  • Start date Start date
ThunderMusic said:
Ben Voigt said:
r norman said:
On Thu, 23 Aug 2007 15:36:18 -0400, "ThunderMusic"

Hi,
The subject says it all... I want to use a byte[] and use it as byte* so
I
can increment the pointer to iterate through it.

What is the fastest way of doing so in C#?


I have recently started with C#, converting C++ applications that must
communicate with external equipment using protocols with rigidly fixed
byte sequences. Initially I was terribly frustrated abandoning my
precious pointer manipulation and access to byte arrays. But I
developed a bunch of utilities to create managed code byte[] arrays
from pieces and extract appropriate values from them and can now avoid
all unmanaged code. Look into Encoding.Convert for converting between
ASCII byte sequences and Unicode, MemoryStream.Write and ToArray along
with BinaryWriter.Write for creating byte[] data and the BitConverter
set of methods to extract values from byte[].

I keep telling myself that staying entirely within managed code is
worth all that effort, but I am still early in the process. Some old
timers here can probably fill you in on much more detail, including
the virtues of doing so.

Staying in managed code may be worth the effort, converting to C# is not.
The C++ compiler still makes better code than the JIT can.

hi,
That's an idea... I'll look into it... probably that some methods should
be more suited to C++ because I'm seeking a bit of performance for key
features in my app.. I never think about managed C++, but it's sure to be
a good idea... I don't want to make a C++ library for only one method
tought...

Thanks a lot

ThunderMusic


Managed C# is just as fast as C++/CLI in this case, the C++/CLI compiler
(C++/CLI to IL) doesn't "unroll" the inner loop (as illustrated by Ben), at
present there is no single managed compiler that performs "loop unrolling".
The JIT compiler (common for all managed languages) doesn't care to unroll
either.

Choosing the right algorithm is key for better performance not the (managed)
language choice.
Just to give you an idea, I optimized the algorithm in the following
function using C# with pointers (fixed/unsafe = non verifiable code).

private const int BASE = 65521; /* largest prime smaller than 65536 */
private const int NMAX = 5552;
//using an unroll factor of 16,
// compile with /o+
private static uint Adler(byte[] databytes)
{
uint adler = 1;
uint s1 = adler & 0xffff;
uint s2 = (adler >> 16) & 0xffff;
uint k;

if (databytes == null) return 1;
uint len = (uint)databytes.Length;
unsafe
{
fixed (byte* fixeddataPtr = &databytes[0])
{
byte* ptr = fixeddataPtr;
while (len > 0)
{
k = len < NMAX ? len : NMAX;
len -= k;
while (k >= 16)
{
s1 += ptr[0];
s2 += s1;
s1 += ptr[1];
s2 += s1;
s1 += ptr[2];
s2 += s1;
s1 += ptr[3];
s2 += s1;
s1 += ptr[4];
s2 += s1;
s1 += ptr[5];
s2 += s1;
s1 += ptr[6];
s2 += s1;
s1 += ptr[7];
s2 += s1;
s1 += ptr[8];
s2 += s1;
s1 += ptr[9];
s2 += s1;
s1 += ptr[10];
s2 += s1;
s1 += ptr[11];
s2 += s1;
s1 += ptr[12];
s2 += s1;
s1 += ptr[13];
s2 += s1;
s1 += ptr[14];
s2 += s1;
s1 += ptr[15];
s2 += s1;

ptr += 16;
k -= 16;
}

if (k != 0)
{
int i = -1;
do
{
s1 += ptr[++i];
s2 += s1;
} while (--k > 0);
}
s1 %= BASE;
s2 %= BASE;
}
}
}
return (s2 << 16) | s1;
}

Calling this function 4000 times on my box, using a byte array of 27500
random values, takes ~ 63 msecs.
The C++/CLI code (optimized build /O2) posted by Ben takes 95 milliseconds
to run the same test.

The same algorithm used in C# using array indexing (no unsafe constructs)
takes 105 msecs, that is only 10 % slower than the C++/CLI code.

private static uint AdlerSafe(byte[] ba)
{
uint adler = 1;
uint s1 = adler & 0xffff;
uint s2 = (adler >> 16) & 0xffff;
uint k;

if (ba == null) return 1;
uint len = (uint)ba.Length;
int offset = 0;
ArraySegment<byte> ptrSegm = new ArraySegment<byte>();
while (len > 0)
{
k = len < NMAX ? len : NMAX;
len -= k;

while (k >= 16)
{
ptrSegm = new ArraySegment<byte>( ba, offset, 16);
int i = ptrSegm.Offset - 1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;

offset += 16;
k -= 16;
}
if (k != 0)
{
ptrSegm = new ArraySegment<byte>( ba, offset, (int)k);
do {

s1 += ptrSegm.Array[offset++];
s2 += s1;
} while (--k > 0);
}
s1 %= BASE;
s2 %= BASE;
}

return (s2 << 16) | s1;
}


See, unsafe code is the fastest, but verifiable C# may be fast enough.
Either way, I don't like this kind of hand optimizing ( nor do I like
"unsafe" C# code), it's up to the compilers to take care about this,
unfortunately at present they don't.

Willy.
 
Willy Denoyette said:
ThunderMusic said:
Ben Voigt said:
On Thu, 23 Aug 2007 15:36:18 -0400, "ThunderMusic"

Hi,
The subject says it all... I want to use a byte[] and use it as byte*
so I
can increment the pointer to iterate through it.

What is the fastest way of doing so in C#?


I have recently started with C#, converting C++ applications that must
communicate with external equipment using protocols with rigidly fixed
byte sequences. Initially I was terribly frustrated abandoning my
precious pointer manipulation and access to byte arrays. But I
developed a bunch of utilities to create managed code byte[] arrays
from pieces and extract appropriate values from them and can now avoid
all unmanaged code. Look into Encoding.Convert for converting between
ASCII byte sequences and Unicode, MemoryStream.Write and ToArray along
with BinaryWriter.Write for creating byte[] data and the BitConverter
set of methods to extract values from byte[].

I keep telling myself that staying entirely within managed code is
worth all that effort, but I am still early in the process. Some old
timers here can probably fill you in on much more detail, including
the virtues of doing so.

Staying in managed code may be worth the effort, converting to C# is
not. The C++ compiler still makes better code than the JIT can.

hi,
That's an idea... I'll look into it... probably that some methods
should be more suited to C++ because I'm seeking a bit of performance for
key features in my app.. I never think about managed C++, but it's sure
to be a good idea... I don't want to make a C++ library for only one
method tought...

Thanks a lot

ThunderMusic


Managed C# is just as fast as C++/CLI in this case, the C++/CLI compiler
(C++/CLI to IL) doesn't "unroll" the inner loop (as illustrated by Ben),
at present there is no single managed compiler that performs "loop
unrolling". The JIT compiler (common for all managed languages) doesn't
care to unroll either.

Choosing the right algorithm is key for better performance not the
(managed) language choice.
Just to give you an idea, I optimized the algorithm in the following
function using C# with pointers (fixed/unsafe = non verifiable code).

private const int BASE = 65521; /* largest prime smaller than 65536 */
private const int NMAX = 5552;
//using an unroll factor of 16,
// compile with /o+
private static uint Adler(byte[] databytes)
{
uint adler = 1;
uint s1 = adler & 0xffff;
uint s2 = (adler >> 16) & 0xffff;
uint k;

if (databytes == null) return 1;
uint len = (uint)databytes.Length;
unsafe
{
fixed (byte* fixeddataPtr = &databytes[0])
{
byte* ptr = fixeddataPtr;
while (len > 0)
{
k = len < NMAX ? len : NMAX;
len -= k;
while (k >= 16)
{
s1 += ptr[0];
s2 += s1;
s1 += ptr[1];
s2 += s1;
s1 += ptr[2];
s2 += s1;
s1 += ptr[3];
s2 += s1;
s1 += ptr[4];
s2 += s1;
s1 += ptr[5];
s2 += s1;
s1 += ptr[6];
s2 += s1;
s1 += ptr[7];
s2 += s1;
s1 += ptr[8];
s2 += s1;
s1 += ptr[9];
s2 += s1;
s1 += ptr[10];
s2 += s1;
s1 += ptr[11];
s2 += s1;
s1 += ptr[12];
s2 += s1;
s1 += ptr[13];
s2 += s1;
s1 += ptr[14];
s2 += s1;
s1 += ptr[15];
s2 += s1;

ptr += 16;
k -= 16;
}

if (k != 0)
{
int i = -1;
do
{
s1 += ptr[++i];
s2 += s1;
} while (--k > 0);
}
s1 %= BASE;
s2 %= BASE;
}
}
}
return (s2 << 16) | s1;
}

Calling this function 4000 times on my box, using a byte array of 27500
random values, takes ~ 63 msecs.
The C++/CLI code (optimized build /O2) posted by Ben takes 95 milliseconds
to run the same test.

The same algorithm used in C# using array indexing (no unsafe constructs)
takes 105 msecs, that is only 10 % slower than the C++/CLI code.

private static uint AdlerSafe(byte[] ba)
{
uint adler = 1;
uint s1 = adler & 0xffff;
uint s2 = (adler >> 16) & 0xffff;
uint k;

if (ba == null) return 1;
uint len = (uint)ba.Length;
int offset = 0;
ArraySegment<byte> ptrSegm = new ArraySegment<byte>();
while (len > 0)
{
k = len < NMAX ? len : NMAX;
len -= k;

while (k >= 16)
{
ptrSegm = new ArraySegment<byte>( ba, offset, 16);
int i = ptrSegm.Offset - 1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;
s1 += ptrSegm.Array[++i];
s2 += s1;

offset += 16;
k -= 16;
}
if (k != 0)
{
ptrSegm = new ArraySegment<byte>( ba, offset, (int)k);
do {

s1 += ptrSegm.Array[offset++];
s2 += s1;
} while (--k > 0);
}
s1 %= BASE;
s2 %= BASE;
}

return (s2 << 16) | s1;
}


See, unsafe code is the fastest, but verifiable C# may be fast enough.
Either way, I don't like this kind of hand optimizing ( nor do I like
"unsafe" C# code), it's up to the compilers to take care about this,
unfortunately at present they don't.

Willy.


Sorry, above does not contain the correct verifiable C# function, here she
is........

private const int MOD_ADLER = 65521;
private static uint AdlerSafe(byte[] databytes)
{
const int unrollFactor = 16;
uint a = 1, b = 0;
int n = -1;
int len = databytes.Length;
while(len > 0)
{
int tlen = len > 5552 ? 5552 : len;
len -= tlen;

for(int c = 0; c < tlen / unrollFactor; c++)
{
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
}
a = (a & 0xffff) + (a >> 16) * (65536 - MOD_ADLER);
b = (b & 0xffff) + (b >> 16) * (65536 - MOD_ADLER);

for(int c = 0; c < tlen % unrollFactor; c++)
{
a += databytes[++n];
b += a;
}
a = (a & 0xffff) + (a >> 16) * (65536 - MOD_ADLER);
b = (b & 0xffff) + (b >> 16) * (65536 - MOD_ADLER);
}
/* It can be shown that a <= 0x1013a here, so a single subtract will
do. */
if (a >= MOD_ADLER)
a -= MOD_ADLER;
/* It can be shown that b can reach 0xffef1 here. */
b = (b & 0xffff) + (b >> 16) * (65536 - MOD_ADLER);
if (b >= MOD_ADLER)
b -= MOD_ADLER;
return ((b << 16) | a);
}
}
 
Sorry, above does not contain the correct verifiable C# function, here she
is........

private const int MOD_ADLER = 65521;
private static uint AdlerSafe(byte[] databytes)
{
const int unrollFactor = 16;
uint a = 1, b = 0;
int n = -1;
int len = databytes.Length;
while(len > 0)
{
int tlen = len > 5552 ? 5552 : len;

Do you get the same result after changing this number? I would guess
probably so, in which case that could make a significant difference as well.
len -= tlen;

for(int c = 0; c < tlen / unrollFactor; c++)
{
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
a += databytes[++n];
b += a;
}
a = (a & 0xffff) + (a >> 16) * (65536 - MOD_ADLER);
b = (b & 0xffff) + (b >> 16) * (65536 - MOD_ADLER);

for(int c = 0; c < tlen % unrollFactor; c++)
{
a += databytes[++n];
b += a;
}
a = (a & 0xffff) + (a >> 16) * (65536 - MOD_ADLER);
b = (b & 0xffff) + (b >> 16) * (65536 - MOD_ADLER);
}
/* It can be shown that a <= 0x1013a here, so a single subtract will
do. */
if (a >= MOD_ADLER)
a -= MOD_ADLER;
/* It can be shown that b can reach 0xffef1 here. */
b = (b & 0xffff) + (b >> 16) * (65536 - MOD_ADLER);
if (b >= MOD_ADLER)
b -= MOD_ADLER;
return ((b << 16) | a);
}
}
 
Ben Voigt said:
Sorry, above does not contain the correct verifiable C# function, here
she
is........

private const int MOD_ADLER = 65521;
private static uint AdlerSafe(byte[] databytes)
{
const int unrollFactor = 16;
uint a = 1, b = 0;
int n = -1;
int len = databytes.Length;
while(len > 0)
{
int tlen = len > 5552 ? 5552 : len;

Do you get the same result after changing this number? I would guess
probably so, in which case that could make a significant difference as
well.

Changing into what?
If you mean into 5550, it makes no real (measurable) difference in
performance.
Take care, this is not the true for the "optimized C# (unsafe)" sample
(based on marc adler's adler32 algorithm), here the NMAX length must be a
multiple of the "unroll factor" while NMAX is the largest n such that
255n(n+1)/2 + (n+1)(BASE-1) <= (2^32)-1 (as per the adler32 algorithm)

5552 is the largest possible value for n, while this value is perfectly
suitable for an "unroll factor" of 16.
Changing NMAX into 5550 (which is a valid number), requires you to change
the "unroll factor" into 15, else the result of the checksum will be
incorrect!

Willy.
 
Willy Denoyette said:
Ben Voigt said:
Sorry, above does not contain the correct verifiable C# function, here
she
is........

private const int MOD_ADLER = 65521;
private static uint AdlerSafe(byte[] databytes)
{
const int unrollFactor = 16;
uint a = 1, b = 0;
int n = -1;
int len = databytes.Length;
while(len > 0)
{
int tlen = len > 5552 ? 5552 : len;

Do you get the same result after changing this number? I would guess
probably so, in which case that could make a significant difference as
well.

Changing into what?
If you mean into 5550, it makes no real (measurable) difference in
performance.
Take care, this is not the true for the "optimized C# (unsafe)" sample
(based on marc adler's adler32 algorithm), here the NMAX length must be a
multiple of the "unroll factor" while NMAX is the largest n such that
255n(n+1)/2 + (n+1)(BASE-1) <= (2^32)-1 (as per the adler32 algorithm)

5552 is the largest possible value for n, while this value is perfectly
suitable for an "unroll factor" of 16.
Changing NMAX into 5550 (which is a valid number), requires you to change
the "unroll factor" into 15, else the result of the checksum will be
incorrect!

No fair, you're actually familiar with the algorithm in question!
 
Ben Voigt said:
Willy Denoyette said:
Ben Voigt said:
Sorry, above does not contain the correct verifiable C# function, here
she
is........

private const int MOD_ADLER = 65521;
private static uint AdlerSafe(byte[] databytes)
{
const int unrollFactor = 16;
uint a = 1, b = 0;
int n = -1;
int len = databytes.Length;
while(len > 0)
{
int tlen = len > 5552 ? 5552 : len;

Do you get the same result after changing this number? I would guess
probably so, in which case that could make a significant difference as
well.

Changing into what?
If you mean into 5550, it makes no real (measurable) difference in
performance.
Take care, this is not the true for the "optimized C# (unsafe)" sample
(based on marc adler's adler32 algorithm), here the NMAX length must be a
multiple of the "unroll factor" while NMAX is the largest n such that
255n(n+1)/2 + (n+1)(BASE-1) <= (2^32)-1 (as per the adler32 algorithm)

5552 is the largest possible value for n, while this value is perfectly
suitable for an "unroll factor" of 16.
Changing NMAX into 5550 (which is a valid number), requires you to change
the "unroll factor" into 15, else the result of the checksum will be
incorrect!

No fair, you're actually familiar with the algorithm in question!

Not really, the algorithm is well documented, all I did was derive the
unsafe C# code from a fast C implementation and optimized the code a bit
such to reduce register pressure while executing the inner loop.
The result is that executing this :
s1 += ptr[0];
s2 += s1;
s1 += ptr[1]
...
results in .....

movzx eax,byte ptr [edi]
add esi,eax
add ebx,esi
movzx eax,byte ptr [edi+1]
....

That means you need a 3 instructions sequence to handle one byte in the
array which is the absolute minimum on X86, even an optimized C build can't
do better.
The loop unrolling reduces the number of writes to memory of the
intermediate result, but has less effect than reducing the register
pressure.
But again, my point is that you should not have to do this, the language
compilers and the JIT should do a better job here. It looks like MSFT is
focusing on features and less on performance.

Willy.
 
Back
Top