Comparing byte arrays

  • Thread starter Thread starter Guest
  • Start date Start date
Hi Peter,

In the one Joe was sending you is a sample from David, Herfried, Spam and
Corrado, although Joe did say the one from Corrado was cool I thought that
the one from Herfried could be faster. Because that can stop on the process
when it is a very large bytearray or when the first byte is direct different
(what is likely when bytearrays are not the same). I made one myself too
one.
Is serializes the bytearray using the memorystream. (A problem with mine is
that it uses 2 times the memory so not good usable with arrays from 20Mb and
more).

I am curious what is the fastest (when the bytes areas length are unequal
than the ones from Herfried and me are of course both the fastest but that
is to add in the procedure from Corrado too)

Cor.

\\\
Dim abyt1() As Byte = {12, 55, 88, 32}
Dim abyt2() As Byte = {12, 55, 87, 32}
If abyt1.Length = abyt2.Length Then
Dim mem1 As New IO.MemoryStream
Dim mem2 As New IO.MemoryStream
Dim binWriter1 As New IO.BinaryWriter(mem1)
Dim binWriter2 As New IO.BinaryWriter(mem2)
binWriter1.Write(abyt1)
binWriter2.Write(abyt2)
Dim binReader1 As New IO.BinaryReader(binWriter1.BaseStream)
Dim binReader2 As New IO.BinaryReader(binWriter2.BaseStream)
binReader1.BaseStream.Position = 0
binReader2.BaseStream.Position = 0
Dim a, b As String
a = binReader1.ReadChars(abyt1.Length)
b = binReader2.ReadChars(abyt2.Length)
If a <> b Then
MessageBox.Show("not equal char")
End If
Else
MessageBox.Show("not equal length")
End If
///
 
Cor said:
\\\
Dim abyt1() As Byte = {12, 55, 88, 32}
Dim abyt2() As Byte = {12, 55, 87, 32}
If abyt1.Length = abyt2.Length Then
Dim mem1 As New IO.MemoryStream
Dim mem2 As New IO.MemoryStream
Dim binWriter1 As New IO.BinaryWriter(mem1)
Dim binWriter2 As New IO.BinaryWriter(mem2)
binWriter1.Write(abyt1)
binWriter2.Write(abyt2)
Dim binReader1 As New IO.BinaryReader(binWriter1.BaseStream)
Dim binReader2 As New IO.BinaryReader(binWriter2.BaseStream)
binReader1.BaseStream.Position = 0
binReader2.BaseStream.Position = 0
Dim a, b As String
a = binReader1.ReadChars(abyt1.Length)
b = binReader2.ReadChars(abyt2.Length)
If a <> b Then
MessageBox.Show("not equal char")
End If
Else
MessageBox.Show("not equal length")
End If
///

This code confuses bytes with characters, which is never a good idea.
In particular, not every byte array is going to be a valid stream of
UTF-8 encoded characters, at which point ReadChars will throw an
exception.

It also ends up using *4* times as much memory: it first copies all the
data into a stream, and then reads the data again into a character
array, which is going to take twice as much memory as the byte array.
 
Peter said:
how can I compare two byte arrays in VB.NET?

Other posters have given you ways using hashes or streams. Personally,
I think it's much easier just to compare each value directly.

This is the code I'd use in C#. I don't know VB.NET well enough to give
you the best, most idiomatic code for that environment, but I suspect
you should be able to understand the C# version:

public static bool CompareByteArrays (byte[] data1, byte[] data2)
{
// If both are null, they're equal
if (data1==null && data2==null)
{
return true;
}
// If either but not both are null, they're not equal
if (data1==null || data2==null)
{
return false;
}
if (data1.Length != data2.Length)
{
return false;
}
for (int i=0; i < data1.Length; i++)
{
if (data1 != data2)
{
return false;
}
}
return true;
}

That's going to be as efficient as any other algorithm *unless* you
want to compare one byte array to several others, in which case hashing
*might* help you. The above is still likely to be the simplest solution
though.
 
Hi Jon,
This code confuses bytes with characters, which is never a good idea.
In particular, not every byte array is going to be a valid stream of
UTF-8 encoded characters, at which point ReadChars will throw an
exception.

In the other direction I would agree with you, in this direction not.
A 8 bits byte becomes an 16 bits unicode, but the value stays the same.
It also ends up using *4* times as much memory: it first copies all the
data into a stream, and then reads the data again into a character
array, which is going to take twice as much memory as the byte array.

When seeiing your message above I realize it even 6 times because the byte
is converted to uni as you said. However, that is exactly as I stated the
bad isue from this methode (Although I first thought it was 4 and said 2
because I thought I had miscalculated myself).

I stay with the same as Herfried showed as I said in my message, which is by
the way the same as yours, but because there was told that others where
better, I showed this as an other methode, which I probably myself never
shall use.

:-)

Cor
 
Cor said:
In the other direction I would agree with you, in this direction not.
A 8 bits byte becomes an 16 bits unicode, but the value stays the same.

No, it really doesn't. If you use BinaryWriter with no encoding
parameter, it will use UTF-8 by default. You're trying to decode a byte
array assuming that it's a valid UTF-8 sequence, which it may not be.
For instance, take:
Dim abyt1() As Byte = {47, &Hc0, &Haf}
Dim abyt2() As Byte = {&Hc0, &Haf, 47}

These are both *actually* invalid UTF-8 sequences, but the .NET decoder
doesn't notice that. However, it *does* decode both into the same
string - so you get a false positive.

I can't actually provoke ReadChars into throwing an exception at the
moment, but it *should*.

As I said, confusing bytes and characters is *always* a bad idea.
When seeiing your message above I realize it even 6 times because the byte
is converted to uni as you said. However, that is exactly as I stated the
bad isue from this methode (Although I first thought it was 4 and said 2
because I thought I had miscalculated myself).

Um, it's still 4 times:

1 for the original
1 for the memory stream
2 for the string

Where else do you think memory is being used?

Consider a simple test case with 1K of bytes in each array, all being
<0x80:

Memory used by original byte arrays: 2K (2*1K)
Memory used by memory streams: 2K (2*1K)
Memory used by strings: 4K (2*2K)

Total memory: 8K = 4*original 2K
I stay with the same as Herfried showed as I said in my message, which is by
the way the same as yours, but because there was told that others where
better, I showed this as an other methode, which I probably myself never
shall use.

Herfried's method is indeed correct. The use of a hash *looks* clever,
but for just comparing two byte arrays it will be less efficient than
comparing them directly, particularly if there is a difference early
on, or the lengths are different. It also has the tiny possibility of
giving a false positive, if the byte arrays are different but produce
the same hash. (Highly unlikely, but possible.)
 
Hi Jon,

I never take investigations if a byte will be converted to a 2 byte
character or not.

If not it is 4 times if it is converted to 2 bytes it is 6 times.
2 streams in memory
2 arrays from 2 bytes

Actualy it is not important, I only made this to show that when you really
want to do streaming than this would be "a" method.

As I said, I would never think about using this, however this did seem to me
to show that the normal comparising of a byte array as you, Herfried and I
am used to is probably the most sufficient.

The more because you can expect that in a byte array when there is a
difference than probably:
- the lenght is unequal
- it shows already with the first bytes, because that is the nature of a
byte array.

But there seems to be a lot of people who are thinking that when you do
looping in your program it is slow. (Although I think that it is probably
done in all the other methods behind the scene to get the same results).

Cor
 
Cor said:
I never take investigations if a byte will be converted to a 2 byte
character or not.
If not it is 4 times if it is converted to 2 bytes it is 6 times.
2 streams in memory
2 arrays from 2 bytes

But there were 2 arrays to start with. You've shown that for each 1K
*per byte array* you end up with an extra 6K *in total*. That means
that for each 2K of original byte array *in total* you end up with 8K
*in total* - 4 times as much.

Work through the example - you only end up with 4 times as much memory
used.
Actualy it is not important, I only made this to show that when you really
want to do streaming than this would be "a" method.

A fatally flawed one, however, due to the char/byte confusion.

If streaming, I'd suggest reading a block at a time, and comparing with
simple byte-by-byte operations. The only tricky bit would be taking
into account that a Read from a stream might not return as much data as
you want it to. You'd either have to loop on each stream to get a full
buffer, then compare the buffers, or manage two partial buffers,
refilling them when necessary.
As I said, I would never think about using this, however this did seem to me
to show that the normal comparising of a byte array as you, Herfried and I
am used to is probably the most sufficient.

The more because you can expect that in a byte array when there is a
difference than probably:
- the lenght is unequal
- it shows already with the first bytes, because that is the nature of a
byte array.

But there seems to be a lot of people who are thinking that when you do
looping in your program it is slow. (Although I think that it is probably
done in all the other methods behind the scene to get the same results).

Of course. Sooner or later, all the algorithms have to "touch" all the
memory, otherwise they can't possibly catch all differences.
 
Hi Jon,
The only tricky bit would be taking
into account that a Read from a stream might not return as much data as
you want it to.

That one I was thinking later on, but a 00 byte in a string stays there so
the string will have that length but only is not showable. (Although I have
thought to check if what I say above is true, however I think we said enough
about this). I never did bring it as an ideal methode, only if you real want
to do it without a for loop, this was also a possibility.

Cor
 
Cor said:
That one I was thinking later on, but a 00 byte in a string stays there so
the string will have that length but only is not showable.

I'm not sure how that's relevant, to be honest...
(Although I have
thought to check if what I say above is true, however I think we said enough
about this). I never did bring it as an ideal methode, only if you real want
to do it without a for loop, this was also a possibility.

If you don't mind it being fundamentally broken :)
 
If you don't mind it being fundamentally broken :)

No

Because I found that For loop the best and I think that with this too the
other methods are more or less broken and that was the major reason I did
make it.

for me this is EOT.

:-)

Cor
 
Back
Top