Binary Read Method?

  • Thread starter Thread starter ShaneO
  • Start date Start date
S

ShaneO

Hello,

I wish to extract embedded string data from a file using a Binary Read
method.

The following code sample is used in VB.NET and similar code is used in
VB6 -

(Assume variable declarations etc.)
FileOpen(iFileIn, sInputFile, OpenMode.Binary, OpenAccess.Read)
iRecordEndAddress = iRecordCount * iRecordSize
For iRecordStartAddress = 1 To iRecordEndAddress Step iRecordSize
FileGet(iFileIn, sData, iRecordStartAddress)
sA = Trim(Strings.Left(sData, 8))
sB = Trim(Strings.Mid(sData, 10, 60))
..
..
..
sOutPutText &= sA & "," & sB & vbCrLf
Next
FileClose(iFileIn)


On the same datafile the VB6 app does the job in <2 secs, however in
VB.NET it takes >15 secs. Now I'm not getting into the issues
surrounding performance between the two languages, but I would like to
know what others suggest as the best/quickest way to perform such a task
under VB.NET (2005).

I've tried the obvious My.Computer.FileSystem.ReadAllBytes and
FileStream methods however any possible speed advantages are lost in
converting the input-stream back into String Characters for my OutPut
Text - unless someone can give me a quick way to do that!

Any suggestions (apart from going back to VB6) would be appreciated.

ShaneO

There are 10 kinds of people - Those who understand Binary and those who
don't.
 
I would be inclined to use a StreamReader something like this:

Dim _sr As New StreamReader(sInputFile)

Dim _chars(iRecordSize - 1) as Char

Dim _text As New StringBuilder

While sr.Peek() >= 0
_sr.Read(_chars, 0, _chars.Length)
_text.AppendFormat("{0},", (New String(_chars, 0, 8)).Trim)
_text.AppendFormat("{0},", (New String(_chars, 9, 60)).Trim)
...
' For the last 'field', do not append a comma
_text.AppendFormat("{0}", (New String(_chars, x, y)).Trim)
_text.Append(Environment.NewLine)
Loop

_sr.Close()

sOutPutText - _text.ToString

You will find that repeated operations on a StringBuilder object are far
more efficient than the equivalent operations on String objects.

You could also create an array of 'field' lengths and an array of 'field'
start positions and use those in a loop like this:

' Make sure that _fieldstarts and _fieldlengths are the same length
Dim _fieldstarts As Integer = New Integer() {0, 9, ... , n}
Dim _fieldlengths As Integer = New Integer() {8, 60, ... , n}

While sr.Peek() >= 0
_sr.Read(_chars, 0, _chars.Length)
Dim _i As Integer
For _i = 0 To _fieldstarts.Length - 2
_text.AppendFormat("{0},", (New String(_chars, _fieldstarts(_i),
_fieldlengths(_i))).Trim)
Next
When the inner loop finishes, _i points to the final element of
_fieldstarts and _fieldlengths
' For the last 'field', do not append a comma, but do append a cr/lf
pair
_text.AppendFormat("{0}{1}", (New String(_chars, _fieldstarts(_i),
_fieldlengths(_i))).Trim, Environment.NewLine)
Loop
 
Stephany said:
I would be inclined to use a StreamReader something like this:

Dim _sr As New StreamReader(sInputFile)

Dim _chars(iRecordSize - 1) as Char

Dim _text As New StringBuilder

While sr.Peek() >= 0
_sr.Read(_chars, 0, _chars.Length)
_text.AppendFormat("{0},", (New String(_chars, 0, 8)).Trim)
_text.AppendFormat("{0},", (New String(_chars, 9, 60)).Trim)
...
' For the last 'field', do not append a comma
_text.AppendFormat("{0}", (New String(_chars, x, y)).Trim)
_text.Append(Environment.NewLine)
Loop

_sr.Close()

sOutPutText - _text.ToString
Thank-you Stephany for your very thorough answer.

I did try similar but still found the constant looping (around 100K
records) was still clobbering performance. I will, however, follow your
example more precisely and test if it works quicker.

In the meantime, I've been experimenting again with using ReadAllBytes
and have found some tweaks to gain some speed improvements. One in
particular and as you mentioned, String objects are not too efficient on
repeated operations, so simply removing the following line -

sOutPutText &= sA & "," & sB & vbCrLf

and modifying it to (an already Open File) -

Print(iFileOut, sA & "," & sB & vbCrLf)

has had a staggering 50% reduction in the overall execution time! (Now
down to >5 secs with other tweaks).

Do you know of a quick method to transfer a consecutive block of bytes
(stored in a Byte Array) into a String? If I could find that I believe
I'd be able to deliver satisfactory performance, as this is currently my
bottleneck. I've looked at System.Text.Encoding.Unicode.GetString but
can't seem to make it work properly!

ShaneO

There are 10 kinds of people - Those who understand Binary and those who
don't.
 
Glad I could help.

What would be interesting from your point of view is to find what 'bits' are
taking the time.

For example:

How long does it take just to read the input file?

Dim _start As DateTime = DateTime.Now
Dim _sr As New StreamReader(sInputFile)
Dim _chars(iRecordSize - 1) as Char
While sr.Peek() >= 0
_sr.Read(_chars, 0, _chars.Length)
Loop
_sr.Close()
Console.WriteLine(DateTime.Now.Subtract(_start).TotalMilliseconds)

Then add in variius 'bits' and take note of the elepsed time.

You will soon find where the bottlenecks are and can concentrate on
techniques to reduce those.
 
Stephany said:
Glad I could help.
I thought you'd be interested in what I finally came up with. I used
mostly what you provided, plus a bit of modifying, and ended up with -

(Assume some variable declarations)
Dim chInputChars(iRecordSize - 1) As Char
Dim sbOutPutText As New System.Text.StringBuilder

Using srInputFile As StreamReader = New StreamReader(sInputFileName,
System.Text.Encoding.ASCII)
Do While srInputFile.Peek() >= 0
srInputFile.Read(chInputChars, 0, chInputChars.Length)
sA = (New String(chInputChars, 0, 8)).Trim
sB = (New String(chInputChars, 9, 60)).Trim
sbOutPutText.Append(sA & "," & sB & vbCrLf)
Loop
srInputFile.Close()
End Using
My.Computer.FileSystem.WriteAllText(sOutPutFileName,
sbOutPutText.ToString, False)


I'm wrapping everything inside a "Using" statement as I'm actually
reading from more than one file in this section of code, so it allows me
to use the same Variable name (srInputFile) a little later. Also, the
"System.Text.Encoding.ASCII" is critical otherwise the ".Read" statement
wouldn't work properly (??) I also found it quicker to use ".Append" in
the manner that I show, rather than ".AppendFormat".

I've never ventured much into the StreamReader but now you've wetted my
appetite I believe I'll use it wherever possible in future!

Finally, the GREAT news is that the timing for what I'm doing is now <1
sec, which is more than twice as fast as what I was achieving in VB6 and
around 25 times faster than where I was when I started this thread.

I've also included the following routine that I use for timing sections
of code, maybe someone will find it useful.

''' <summary>
''' First call Starts the CodeTimer. Second call returns elapsed
Milliseconds.
''' Sample Usage: (on 2nd call) Debug.Print (TimeSection)
''' </summary>
''' <returns></returns>
''' <remarks></remarks>
Function TimeSection() As Double
If CodeTimer.IsRunning Then
CodeTimer.Stop()
Return CodeTimer.ElapsedMilliseconds
Else
CodeTimer = Stopwatch.StartNew
End If
End Function


Thanks again Stephany. It's input like yours that makes these NG's
worthwhile.

ShaneO

There are 10 kinds of people - Those who understand Binary and those who
don't.
 
Cor said:
Shane,

I have not any idea about the time aspect, but I surely would use the binary
reader in your case.

http://msdn2.microsoft.com/en-us/library/system.io.binaryreader.aspx

I hope this helps,

Cor

Thank-you Cor, I was however able to resolve this one with help from
Stephany (see post above).

There is certainly some merit in using the Binary Reader, and I did try
this also, but for absolute simplicity (and speed) the StreamReader
solution has resulted in exactly what I wanted.

Regards,

ShaneO

There are 10 kinds of people - Those who understand Binary and those who
don't.
 
Tres cool :)

Now it's time for Strings 101. (And no, I didn't mean 101 Strings which was
the name of a very good orchestra for those who didn't know that, or didn't
want to know that.)

Because a String object is 'immutable' every time we do an operation that
'changes' it or assigns it's value to something else, we actually create a
new string. In a lot of cases this is hardly noticable, however when we have
a lot of such operations happening in a fairly short space of time (a tight
loop for instance) the overhead inherent in handling strings soon makes it's
presence felt.

Take for example:

Dim _s As String = (New String(chInputChars, 0, 8)).Trim

The 'New String(chInputChars, 0, 8)' creates one string, the Trim method
returns a second string and the assignment to _s creates yet a third string.

Now multiply that by however many 'fields' you have in your 'record' and
then multiply the result by the number of 'records' and the number of new
String objects created inside the loop is not insignificant. If you have 10
'fields' and 100,000 'records' then that is 3,000,000 new strings. Not only
are they created, they also have to be dealt to by the garbage collector.

I assume from your code that you may have extraneous trailing whitespace on
any given 'field' and that it, in fact, does need to be trimmed off. This
means that you do need the Trim operation which needs a String object so
there are 2 new strings per 'field' that you can't do away with.

If the Trim operation is not, in fact necessary, then doing away with it
will save 10 operations per 'record' which is 1,000,000 operations over the
process. Now we have only 2,000,000 new strings which is a significant
saving.

Now the question has to be, are you doing anything else with the variables
sA, sB, etc., or are you just using them as a convienience? If it is the
latter then modifying:

sbOutPutText.Append(sA & "," & sB & vbCrLf)

to:

sbOutPutText.Append(New String(chInputChars, 0, 8) & "," & New
String(chInputChars, 9, 60) & vbCrLf)

then for out 100,000 'records' of 10 'fields' each the number of new strings
is now reduced to 1,000,000 for the entire process, an even more significant
saving.

So the loop would now become:

Do While srInputFile.Peek() >= 0
srInputFile.Read(chInputChars, 0, chInputChars.Length)
sbOutPutText.Append(New String(chInputChars, 0, 8) & "," & New
String(chInputChars, 9, 60) & "," & ... & vbCrLf)
Loop

Try it and see how you get on.
 
This is slow:
sbOutPutText.Append(New String(chInputChars, 0, 8) & "," & New
String(chInputChars, 9, 60) & vbCrLf)

This is MUCH faster (about 300-400%):

sbOutPutText.Append(chInputChars, 0, 8)
sbOutPutText.Append(","c)
sbOutPutText.Append(chInputChars, 9, 60)
sbOutPutText.Append(vbCrLf)
 
Aha ... You spotted the deliberate mistake :)

Just goes to show how easy it is to throw strings about willy-nilly and not
bother checking out all the overloads of methods that are available.
 
Stephany said:
Take for example:

Dim _s As String = (New String(chInputChars, 0, 8)).Trim

I assume from your code that you may have extraneous trailing whitespace...
Yes, the "strings" do have varying amounts of whitespace that needs to
be removed.
Now the question has to be, are you doing anything else with the variables
sA, sB, etc., ......
You guessed it, there are other things being done with these variables.
They are being interrogated for certain values, possibly being
altered, and then added to a Structure so I do need to separate these
variables out for this purpose.

One other question you might be able to answer for me -

Using the same routine, how would you extract a numeric Double in
addition to Strings? Any ideas? I've started to look at
Buffer.BlockCopy to take the data from the Char Array and put it into a
Byte Array and then convert to a Double variable, however, I wonder if
you (or anyone else) knows of a simpler method? I don't believe there's
a simple "ConvertCharArrayToDouble" method, if there is, I haven't found
it!!

Thank-you for all your assistance so far, it is greatly appreciated.

ShaneO

There are 10 kinds of people - Those who understand Binary and those who
don't.
 
Typical Aussie - Bowl an underarm when you're not looking :)

Up to now you have given the impression that the file contained 'records' of
fixed length 'fields' of purely textual data.

Now you are implying that the file contains the binary representation of
various data types.

Is this the case?

If so, then you need to be reading the data from the file as bytes reather
that chars or strings.

You have also implied that the input files are not that big that you can't
fit an entire file into memory. Realisticly, what is the biggest file (in
bytes) that you need to deal with?

Assuming that you can fit it into memory then the
System.IO.File.ReadAllBytes() method will read the entire file into an array
of bytes in a single chunk:

Dim _bytes as Byte() = File.ReadAllBytes(_filename)

You will also need a 'pointer' that always indicates the next byte to be
dealt with:

Dim _pointer as Integer = 0

Your processing loop now becomes:

While _pointer < _bytes.Length
End While

Inside the loop, for each 'field', you need to deal with the appropriate
number of bytes as the expected type and advance the pointer accordingly:

While _pointer < _bytes.Length
sA = Encoding.ASCII.GetString(_bytes, _pointer, 8).Trim
_pointer += 9

sB = Encoding.ASCII.GetString(_bytes, _pointer, 60).Trim
_pointer += 60
...
End While

If you need to deal with a double then BitConverter is your friend:

Dim _d As Double = BitConverter(_bytes, _pointer)
_pointer += 8

You are not going to be able to expect lightning speed because of the
processing that needs to be done.

Refering back to your original post, I not that you also imply that the
'fields' are not contiguous. It appears that the first 'field' is from
position 1 thru position 8 but the second 'field is from position 10 to
position 69. If this is correct, what is the value of the byte at position 9
and what is it's purpose?
 
Stephany said:
Up to now you have given the impression that the file contained 'records' of
fixed length 'fields' of purely textual data.

Now you are implying that the file contains the binary representation of
various data types.

Is this the case?
Yes, for original file it's all Text, but I'm now trying to apply my
new-found StreamReader/StringBuilder methods to another file which
contains both Text and Numeric fields. (I did mention I'd do this ;-) )
If so, then you need to be reading the data from the file as bytes reather
that chars or strings.
Yes, I agree.
You have also implied that the input files are not that big that you can't
fit an entire file into memory. Realisticly, what is the biggest file (in
bytes) that you need to deal with?
In this case, the largest file should not exceed 200MB.
Assuming that you can fit it into memory then the
System.IO.File.ReadAllBytes() method will read the entire file into an array
of bytes in a single chunk:

Dim _bytes as Byte() = File.ReadAllBytes(_filename)
Hmmmm..... Not all machines have sufficient "available" memory.
You are not going to be able to expect lightning speed because of the
processing that needs to be done.
As also mentioned in an earlier post, an existing VB6 app (which I
developed some time ago) does this VERY fast, however the almost
identical code in VB.NET is a dog. I have faith that, with the right
method, VB.NET will equal or exceed the performance I've experienced
under VB6. It's already been proven with the earlier code you so
graciously provided. And before anyone asks, I'm converting this app to
VB.NET because the clients require some considerable enhancements to the
original app and .NET is clearly the best option for what they need.
It's just letting me down in this single area.
Refering back to your original post, I not that you also imply that the
'fields' are not contiguous. It appears that the first 'field' is from
position 1 thru position 8 but the second 'field is from position 10 to
position 69. If this is correct, what is the value of the byte at position 9
and what is it's purpose?
The datafile is a proprietary format (hence the need for proprietary
Read methods) with fixed field lengths and uses chr$(0) as a delimiter
between fields.

I certainly don't expect you to write the code for me, your help so far
has been invaluable as until now I've been reluctant to venture into the
StreamReader and StringBuilder methods. I can clearly see the path I
need to take from here. I guess I just needed a push in the right
direction.

ShaneO

There are 10 kinds of people - Those who understand Binary and those who
don't.
 
Stephany said:
Typical Aussie - Bowl an underarm when you're not looking :)
Oh, and as you're a Kiwi (?), I might sometimes refer to VBsux! :-)

ShaneO

There are 10 kinds of people - Those who understand Binary and those who
don't.
 
Back to the original post again.

So the 'fields' in a 'record' are seperated by NUL, (&H0, 0x0, Chr(0),
Chr$(0) or whatever you want to call it.

Are the 'records' seperated by anything special, perhaps 2 consecutive
NUL's?

If so then I would be inclined to take a completely different approach that
would be far more efficient.
 
Stephany said:
Back to the original post again.

So the 'fields' in a 'record' are seperated by NUL, (&H0, 0x0, Chr(0),
Chr$(0) or whatever you want to call it.

Are the 'records' seperated by anything special, perhaps 2 consecutive
NUL's?
No, the last field of one record is only separated by &H0 from the first
field of the next record. Each record is of a fixed length however.
You're considering Binary Block-Grabs at the data?

ShaneO

There are 10 kinds of people - Those who understand Binary and those who
don't.
 
Stephany said:
And you can have fush n chups but not VBsex !!!!!!!!!!!!!!
That's very funny - So funny I nearly fell off my Chilly-Bin!!!

ShaneO

There are 10 kinds of people - Those who understand Binary and those who
don't.
 
And the last 'record' in the file obeys the fixed-length rule, i.e., the
last byte in the file is &H0?

And the next question is, what does the varying amount of whitespace consist
of? Is it a sequence of spaces (&H20)? If so, does a sequence of 2 spaces
occur anywhere else other than within the 'whitespace' areas?
 
Back
Top