Scanning byte arrays

V

Vince Panuccio

Hello,

If im looking for a pattern of byes in a byte array, what would be the
best approach?

I could convert the array into a string and use IndexOf recursivly by
remembering the last position I encountered a match, but it seems to
be the wrong way to go about it, especially if I have an array a
couple of MB in size.

My initial thoughts are to just use a for loop and check for a byte
matching the first instance of the search pattern.


For example

asodaiudoqiiqmngm,3n23n4234kj23rasdrwebHelloWorldkajsdiouahsd

My way of doing this would be to look for 'H' and then check every
character after that using a bunch of If's.

Is there a built in method or an algorithm I could use? Google has not
helped.

Cheers
 
N

Nicholas Paldino [.NET/C# MVP]

Vince,

If you have the contents in a string form, then you can use the IndexOf
method. However, if you need to search by bytes (and you are using
characters here, indicating to me that these are ascii values), then you
could create a generic method, like so:

public static int ByteIndexOf(byte[] searched, byte[] find, int start)
{
// Do standard error checking here.

// Did the values match?
bool matched = false;

// Cycle through each byte of the searched. Do not search past
// searched.Length - find.Length bytes, since it's impossible
// for the value to be found at that point.
for (int index = start; index <= searched.Length - find.Length; ++index)
{
// Assume the values matched.
matched = true;

// Search in the values to be found.
for (int subIndex = 0; subIndex < find.Length; ++subIndex)
{
// Check the value in the searched array vs the value
// in the find array.
if (find[subIndex] != searched[index + subIndex])
{
// The values did not match.
matched = false;

// Break out of the loop.
break;
}
}

// If the values matched, return the index.
if (matched)
{
// Return the index.
return index;
}
}

// None of the values matched, return -1.
return -1;
}

Of course, this could probably be optimized, but this is the general
idea.

Hope this helps.
 
I

Ignacio Machin \( .NET/ C# MVP \)

Hi,



Vince Panuccio said:
Hello,

If im looking for a pattern of byes in a byte array, what would be the
best approach?

I could convert the array into a string and use IndexOf recursivly by
remembering the last position I encountered a match, but it seems to
be the wrong way to go about it, especially if I have an array a
couple of MB in size.

I don't like that idea neither, at the end you do have a byte array, not a
string.
My initial thoughts are to just use a for loop and check for a byte
matching the first instance of the search pattern.

That will work and not only that but whatever method you use either you or
the framework will do a loop in the array.
The code should not be too complex anyway. Just look the first byte and then
look if the others are a match.
 
C

Christian Stapfer

Vince Panuccio said:
Hello,

If im looking for a pattern of byes in a byte array,
what would be the best approach?

By "pattern" you seem to mean a substring? Right?
(Because you could, conceivably, mean something like
a substring that matches a regular expression or something
like that).
Well, efficient solutions to the substring matching
problem are well known: Karp-Rabin or Knuth-Morris-Pratt.
Just search the net for an implementation
(e.g. http://www-igm.univ-mlv.fr/~lecroq/string/node7.html#SECTION0070)

If, however, you mean searching for a regular string
pattern, you should, ideally, compile a finite-automaton
that does the matching. The important property, as regards
efficiency for large master strings, is that you never
have to check the *same* character (byte in your case)
more than once.

Regards,
Christian
 
C

Christian Stapfer

Christian Stapfer said:
By "pattern" you seem to mean a substring? Right?
(Because you could, conceivably, mean something like
a substring that matches a regular expression or something
like that).
Well, efficient solutions to the substring matching
problem are well known: Karp-Rabin or Knuth-Morris-Pratt.
Just search the net for an implementation
(e.g. http://www-igm.univ-mlv.fr/~lecroq/string/node7.html#SECTION0070)

Sorry, I meant to give you this link:
http://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140
and this link:
http://www-igm.univ-mlv.fr/~lecroq/string/node5.html#SECTION0050
as well.

Translating these C-language implementations to
C# shouldn't be that difficult.

Regards,
Christian
 
V

Vince Panuccio

Sorry, I meant to give you this link:http://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140
and this link:http://www-igm.univ-mlv.fr/~lecroq/string/node5.html#SECTION0050
as well.

Translating these C-language implementations to
C# shouldn't be that difficult.

Regards,
Christian


I am actually talking about searching for a string made up of ASCII
characters.

Ignacio, I understand what your saying but to me a byte array and a
string are the same thing. Is a string not just an array of characters
but only refered to differently? Is a character array not just an
array of byes represented differently?


I think I'll go with Christian's approach here. The Boyer-Moore
algorithm seems like the way to go.

Thanks for all your insights.
 
J

Jon Skeet [C# MVP]

I am actually talking about searching for a string made up of ASCII
characters.

But is the byte array *also* just an encoded string?
Ignacio, I understand what your saying but to me a byte array and a
string are the same thing.

They're definitely not.
Is a string not just an array of characters but only refered to differently?

Yes - although unicode "funnies" mean that the same characters can be
represented in different ways, considering normalised forms etc.
Is a character array not just an array of byes represented differently?

The "represented differently" is the key here. You shouldn't take
arbitrary binary data and treat it as an ASCII string, for instance.
If the byte array has been generated by taking a string and applying a
particular encoding, that's fine - but otherwise, you're likely to
lose data.

Jon
 
M

Marc Gravell

CLR strings are always unicode; for strings of ASCII characters then
yes: you can *convert* it uniquely to a byte-array - but this requires
an additional step to copying the data using your chosen encoding.
The boyer-moore algorithm sounds promising, but if you can work with
the string as a char[] rather than a byte[] you will save the
unnecessary work. The algorithm seems to talk about characters, not
bytes, so I can't see this being a problem. But not my area ;-p

Marc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top