File searching is really slow, how do I speed it up?

  • Thread starter Thread starter janderson
  • Start date Start date
J

janderson

Hi gang,

I have a win32 C# app that needs to recursively search for a particular
file type really fast. It always searches the same place for these files.

I'm using Directory.GetDirectories and Directory.GetFiles to do it
currently. Because I know the location is the same every time I build a
cache of all the files and directories. If a directory changes I update
its contents on startup of the app.

Unfortunately when I do a last-date-of-modification check on a directory
it only reports on changes of files that are directly under it. So I
have to check every single directly rather then simply checking the
parent directory.

While I get a significant improvement when none of the directories
change (its much faster to load a file with all the entries in it then
to do a directory search). In particular the first time the app loads
up and has to search every file (around 100,000 and growing) which is
painfully slow (2 or 3 minutes to start up the app).

What techniques/APIs can I use to speed up file searching?

Is there a way to more directly look at the raw data that make up the
file-tables?

Cheers
 
[...]
What techniques/APIs can I use to speed up file searching?

Well, one thing that comes to mind is that perhaps there is an API, .NET
or native Windows, that has access to the Windows index-based searching.
I don't know whether there is or not, but it's worth considering.
Is there a way to more directly look at the raw data that make up the
file-tables?

Surely there is, but I'm sure you can't do it via .NET directly, nor am I
at all convinced it would significantly speed up your search, since I
suspect that Windows is already pulling the "raw data" off the disk about
as fast as is possible, and feeding it to you with relatively little
overhead.

Beyond that, I have to question a design that requires the search of
100,000 files or more during startup. At the very least, if you feel
that's necessary, I think you'll simply have to take the performance hit.
That's a *lot* of files, and a lot of i/o. I/o is slow, and highly
dependent on the device. The best way to speed up your design is to
change it so that you don't have to look at every file in every directory
in the first place.

How you might do that, I don't know. I don't know why you have to search
for this particular file on startup, so I can't suggest ways you could fix
the design so that's not required.

On that point, keep in mind that with such a lengthy search, you really
have no guarantee when the search is over that you have accurate results.
For example, once you've checked a directory, it could be that the file
gets created after that, while you're checking the rest of the
directories. Or maybe you detect a file that gets removed before your
search is done.

Again, since I don't know exactly how the file gets created or deleted, or
why it is you want to look for it at startup, I can't really comment on
the exact nuances of searching for it. But the above issues would apply
in the general case, at least.

If in the end you decide you don't want to change the design, you may want
to consider including some sort of auxiliary background process that
essentially indexes the interesting directories, and can tell your main
process immediately upon startup whether the file you're interested in
exists. That would be sort of like using the built-in Windows indexing,
but wouldn't require that you figure out how to get at it (assuming it's
possible at all).

Pete
 
Peter said:
[...]
What techniques/APIs can I use to speed up file searching?

Well, one thing that comes to mind is that perhaps there is an API, .NET
or native Windows, that has access to the Windows index-based
searching. I don't know whether there is or not, but it's worth
considering.
Is there a way to more directly look at the raw data that make up the
file-tables?

Surely there is, but I'm sure you can't do it via .NET directly, nor am

If its indirect, I'm fine with that.
I at all convinced it would significantly speed up your search, since I
suspect that Windows is already pulling the "raw data" off the disk
about as fast as is possible, and feeding it to you with relatively
little overhead.

Maybe, however I think net may be pulling in things a simply don't need.
Beyond that, I have to question a design that requires the search of
100,000 files or more during startup. At the very least, if you feel
that's necessary, I think you'll simply have to take the performance
hit. That's a *lot* of files, and a lot of i/o. I/o is slow, and
highly dependent on the device. The best way to speed up your design is
to change it so that you don't have to look at every file in every
directory in the first place.

It shouldn't be that slow. I know the file-system is all over the place
but the amount of data for 100,000 is only about 2mbs worth of
filenames. But maybe windows is doing a good job of the search and its
just how the table is layed out.
How you might do that, I don't know. I don't know why you have to
search for this particular file on startup, so I can't suggest ways you
could fix the design so that's not required.

On that point, keep in mind that with such a lengthy search, you really
have no guarantee when the search is over that you have accurate
results. For example, once you've checked a directory, it could be that
the file gets created after that, while you're checking the rest of the
directories. Or maybe you detect a file that gets removed before your
search is done.

I'm certain the user won't change things while they are using the tool.
Again, since I don't know exactly how the file gets created or deleted,
or why it is you want to look for it at startup, I can't really comment
on the exact nuances of searching for it. But the above issues would
apply in the general case, at least.

In this design its is necessary. Have you ever used visual assist? It
needs to work kinda like that. The whole idea is that the fastest way
to get to something is to simply type it and see the list of
possibilities shrink as you get closer to what you want. Its a optimal
design thing.
If in the end you decide you don't want to change the design, you may
want to consider including some sort of auxiliary background process
that essentially indexes the interesting directories, and can tell your
main process immediately upon startup whether the file you're interested
in exists. That would be sort of like using the built-in Windows
indexing, but wouldn't require that you figure out how to get at it
(assuming it's possible at all).

This is a good idea. I would probably need to run the service all the
time. There is a way to monitor file-system changes. I may give that a
shot. I hope it doesn't slow down computer time by much like
visualstudio does. Actually I noticed the google desktop searcher does
this.

Thanks for your thoughts.
 
If its indirect, I'm fine with that.

Well, there's always p/invoke. If you can find a way in the native Win32
API, you can use p/invoke to call outside of managed code.
Maybe, however I think net may be pulling in things a simply don't need.

I would be surprised if it was accounting for even 10% overhead.
It shouldn't be that slow. I know the file-system is all over the place
but the amount of data for 100,000 is only about 2mbs worth of
filenames. But maybe windows is doing a good job of the search and its
just how the table is layed out.

You can't just consider the total amount of data. In fact, for the kind
of search you're doing, the total amount of data is probably the
least-relevant piece of information. Most important is how the data is
distributed on the disk, and how many seeks it takes to get at it all. It
all depends on how the directory data is laid out. If it's not all in one
place, seeks and cache misses are going to really slow things down.
I'm certain the user won't change things while they are using the tool.

Then why not just run your search in the background, using the most recent
previous results until you have updated the data? The user will get
nearly 100% functionality during that window after startup before the
search has completed, without causing a delay in the application starting
up.
In this design its is necessary. Have you ever used visual assist? It
needs to work kinda like that. The whole idea is that the fastest way
to get to something is to simply type it and see the list of
possibilities shrink as you get closer to what you want. Its a optimal
design thing.

I have never heard of "virtual assist". I found this link using Google:
http://virtualassistoncall.com/ I don't see what searching 100,000 files
has to do with that though. Your explanation also doesn't actually give
me any insight into why it is that searching 100,000 files when you start
your application is necessary. "Optimal design" and "search 100,000 files
on startup" don't strike me as correlated, at least not at first glance.
This is a good idea. I would probably need to run the service all the
time. There is a way to monitor file-system changes. I may give that a
shot. I hope it doesn't slow down computer time by much like
visualstudio does. Actually I noticed the google desktop searcher does
this.

Yes, all of the various desktop search utilities have to do something
similar. Of course, it begs the question as to whether it really makes
sense to have a half-dozen different applications installed, all of which
are trying to do some sort of indexing on the disk. :) But it's
technically possible.

Pete
 
Peter said:
Well, there's always p/invoke. If you can find a way in the native
Win32 API, you can use p/invoke to call outside of managed code.


I would be surprised if it was accounting for even 10% overhead.


You can't just consider the total amount of data. In fact, for the kind
of search you're doing, the total amount of data is probably the
least-relevant piece of information. Most important is how the data is
distributed on the disk, and how many seeks it takes to get at it all.
It all depends on how the directory data is laid out. If it's not all
in one place, seeks and cache misses are going to really slow things down.


Then why not just run your search in the background, using the most
recent previous results until you have updated the data? The user will
get nearly 100% functionality during that window after startup before
the search has completed, without causing a delay in the application
starting up.


I have never heard of "virtual assist". I found this link using Google:
http://virtualassistoncall.com/ I don't see what searching 100,000
files has to do with that though. Your explanation also doesn't
actually give me any insight into why it is that searching 100,000 files
when you start your application is necessary. "Optimal design" and
"search 100,000 files on startup" don't strike me as correlated, at
least not at first glance.

Sorry I meant visual assist.

http://www.wholetomato.com/
 
Sorry I meant visual assist.

http://www.wholetomato.com/

Actually, that's what you wrote. I must be getting dyslexic in my old
age. :)

Sorry for the confusion. Still having seen that, it doesn't seem a lot
different than Visual Studio's built-in "Intellisense" (though it does add
to that). VS doesn't have to search through 100,000 files to accomplish
what it does, nor does it prevent you from using VS before Intellisense
has caught up (in the rare cases when it hasn't...I usually only see the
"still building database" message on my really slow PIII 550Mhz PC anyway).

I do remain unconvinced that you're going to make the search of 100,000
files much faster (unless the code you're using now is *really* broken
somehow, and it doesn't sound like it is). It's more likely that you'll
have success with an approach that simply moves the search out of the
user's way so it's not a bottleneck (whether that's doing it in a
different process, or doing in the background the way Intellisense does).

And of course, I still wonder at the need to do the search in the first
place, but without more details I can only take as granted that you're
correct about that need.

Pete
 
Can you split the app up into two parts?



1) The part that uses the files



2) A service that is in charge of keeping track of new files? This service
can always run in the background and use the file system watcher to be
triggered on new/changed files even if the other app is not running.



The 1st app can query the second one when it starts for a list of any
changed files and will only need to go directly to those files on startup
instead of having to search all of them.
 
Ray said:
Can you split the app up into two parts?



1) The part that uses the files



2) A service that is in charge of keeping track of new files? This service
can always run in the background and use the file system watcher to be
triggered on new/changed files even if the other app is not running.



The 1st app can query the second one when it starts for a list of any
changed files and will only need to go directly to those files on startup
instead of having to search all of them.

This is probably what I'm going to end up doing. Thanks.
 
Peter said:
Actually, that's what you wrote. I must be getting dyslexic in my old
age. :)

Sorry for the confusion. Still having seen that, it doesn't seem a lot
different than Visual Studio's built-in "Intellisense" (though it does
add to that). VS doesn't have to search through 100,000 files to
accomplish what it does, nor does it prevent you from using VS before
Intellisense has caught up (in the rare cases when it hasn't...I usually
only see the "still building database" message on my really slow PIII
550Mhz PC anyway).

intellisense in visual assist is nice, but thats not what I'm talking
about. You are able to type alt+O partial name of the file u want,
enter. Bam your at that file. Rather then navigating a slow tree. A
mouse movement takes 1-4 seconds to navigate each branch, thats if you
don't have to scroll (which is extremely inefficient) and read 20
different names 3 levels deep. Maybe I'm slow at navigating trees but
it takes me about 20-30 seconds to find a file in VS.

A keyboard click takes (at 60 words a minute) 1 second or less, normally
you only need to type 3 or 4 letters to get what you want. So 7
keystrokes/seconds in total is a lot more efficient. (Visual assist
takes me on average 3 seconds to get to a file).

I guess its a case of you have be be there.
I do remain unconvinced that you're going to make the search of 100,000
files much faster (unless the code you're using now is *really* broken
somehow, and it doesn't sound like it is). It's more likely that you'll
have success with an approach that simply moves the search out of the
user's way so it's not a bottleneck (whether that's doing it in a
different process, or doing in the background the way Intellisense does).

And of course, I still wonder at the need to do the search in the first
place, but without more details I can only take as granted that you're
correct about that need.

Pete

I came across this today, thought it may be useful.

http://video.google.com/videoplay?docid=-6856727143023456694&hl=en
 

At 90 minutes, no. Not useful at all.

From your previous description, you seem to be talking about an indexed
file search. Which is fine, and a common enough practice. But you don't
get that for free. If you intend to index even just the filenames of
100,000 files, you have to read that data from the disk, and that will
take a non-trivial amount of time.

Note that all along, I have assumed that in spite of my failure to
understand the exact application you're trying to write, you really do
need to do what you're asking about. I haven't been questioning that. I
have only questioned whether your expectation that you can do it in some
unnoticeable amount of time is a reasonable one.

I could be wrong about that, but the last post in this thread was four
weeks ago. That seems like plenty of time to have explored your options
and found a faster way to get the file information, if one exists. So
that begs the question: have you in fact found any significant improvement
on the file i/o aspect of your problem?

Absent any progress on the performance aspect, I stand by my previous
suggestion: the way to do this is to do it as other indexing
implementations handle it. Index in the background, and offer the user
whatever functionality you can, however minimal at any given moment, with
whatever has been indexed so far. Once an initial index has been built,
this will provide nearly perfect results immediately, and should only take
a few minutes or so (maybe 10 or 15 if you throttle your indexing so that
the user's other work isn't impeded noticeably) for the index to be
brought up to date.

Pete
 
Back
Top