VB.NET Really slow filestream

  • Thread starter Thread starter Frank
  • Start date Start date
F

Frank

Hello all, I'm new here and I've searched a lot before posting my question,
but I can't seem to find a solution to my problem.

I'm working on a database program that I created myself, I created my own
file format and I store information about clients in there. So far, so good,
but when I try to search the content of my file, that's where it becomes
slow....I try to check if a byte is > than 0 in every client's
profile....there are 100 000 clients and the file size is about 200mb.

It goes something like this....

Dim objFile As New System.IO.FileStream("Clients.pcf", IO.FileMode.Open)

For i = 0 To 99999
If Results(i) Then
objFile.Position = (Jump * i) + position
If objFile.ReadByte() > 0 Then
Results(i) = False
End If
End If
Next

Results() is an array of Bool and is used to indicate if the client matches
what I'm looking for. It is used several times by different Subs and passed
as an argument.
Jump indicates the length of each client's profile in bytes.
Position indicates where the information is located within the client's
profile.

My computer is old (1gHz), but still, a task like that takes almost 30
seconds...I found out that this line slows everything down: objFile.Position
= (Jump * i) + position, but a multiplication is not supposed to take that
long (I've tried it separately) and if I click a second time on the button
that handles that event, it takes less that a second to get my results....

I've even looked at the CPU usage and it goes up only for the
second....what's happening? Is VB copying my file somewhere?

I must say that I'm more that confused right now....I'm sorry if this is a
long post, but it's a pretty complex situation and I would appreciate every
suggestion.
 
Frank said:
Hello all, I'm new here and I've searched a lot before posting my
question,
but I can't seem to find a solution to my problem.

I'm working on a database program that I created myself, I created my own
file format and I store information about clients in there. So far, so
good,
but when I try to search the content of my file, that's where it becomes
slow....I try to check if a byte is > than 0 in every client's
profile....there are 100 000 clients and the file size is about 200mb.

It goes something like this....

Dim objFile As New System.IO.FileStream("Clients.pcf", IO.FileMode.Open)

For i = 0 To 99999
If Results(i) Then
objFile.Position = (Jump * i) + position
If objFile.ReadByte() > 0 Then
Results(i) = False
End If
End If
Next

Results() is an array of Bool and is used to indicate if the client
matches
what I'm looking for. It is used several times by different Subs and
passed
as an argument.
Jump indicates the length of each client's profile in bytes.
Position indicates where the information is located within the client's
profile.

My computer is old (1gHz), but still, a task like that takes almost 30
seconds...I found out that this line slows everything down:
objFile.Position
= (Jump * i) + position, but a multiplication is not supposed to take that
long (I've tried it separately) and if I click a second time on the button
that handles that event, it takes less that a second to get my results....

I've even looked at the CPU usage and it goes up only for the
second....what's happening? Is VB copying my file somewhere?

I must say that I'm more that confused right now....I'm sorry if this is a
long post, but it's a pretty complex situation and I would appreciate
every
suggestion.

It is not the multiply that takes the time. It is the objFile.Position that
is taking the time.

LS
 
Frank said:
Dim objFile As New System.IO.FileStream("Clients.pcf",
IO.FileMode.Open)

For i = 0 To 99999
If Results(i) Then
objFile.Position = (Jump * i) + position
If objFile.ReadByte() > 0 Then
Results(i) = False
End If
End If
Next

Results() is an array of Bool and is used to indicate if the client
matches what I'm looking for. It is used several times by different
Subs and passed as an argument.
Jump indicates the length of each client's profile in bytes.
Position indicates where the information is located within the
client's profile.

My computer is old (1gHz), but still, a task like that takes almost
30 seconds...

Usually, a whole cluster (4096 bytes) is read, not only 1 byte. At the
application level, it is only 100,000 bytes, at the file system level ~390
MB (100,000 x 4.096), physically the whole file (some read ops hit the same
cluster). (perhaps technically not completely correct, but you get the
point) Including some overhead, this can take a while. Though, 30 seconds
seems to be a bit long. Don't know why this is. The 2nd time, everything is
already in OS' cache, so it's extremley faster.

(I experienced the same trying to read the chunks of an AVI file. Many times
reading very few bytes couldn't be that slow - I thought.)


Armin
 
Well, then how do I search faster, what can I do to solve the problem?

Like others have said, use a real database :-)

Or, use XML, and then you can read it in via an XML reader. However,
searching large XML files is still quite slow, due to the lack of keys.

Or as Armin pointed out, you should read data in large blocks - rather than
reading a single byte at the time, reader a large chunk.

Or you can maintain an index of data like a real DBMS does.

But in general, you're reinventing the wheel. Database systems take YEARs
to perfect and is generally something people avoid writing themselves.
 
I think the advise to turn to a real DB is sound, this is the kind of thing
they do best. But if you want to 'roll your own', then maybe there are a few
things you might want to look at. You talk about the file stream being
really slow; compared to what? Have you actually implemented this file
structure in a different language so that you have something to compare to?
Based on your description, the obvious (to me anyway) problem is that you
are IO bound. The fact that a second pass (after the OS has cached the file)
takes only a second, clearly says where the problem lies. 30 seconds to read
200MB seems a bit long, could the file be badly fragmented? You might want
to defragment the drive to see if it improves the search time. If that does
not help, maybe a faster hard drive will!
If none of the above helps (enough), and you still want to do your own file
structure, consider keeping a seperate parallel record structure for each of
the clients that maintains the state of your Boolean flags, or any other
search criteria that is time critical.
The idea here is to have a much smaller file to read, that will still give
you the ability to determine which records match your criteria. Of course,
you now have the problem of keeping the to files in synch.
 
Back
Top