Best way to read from directory with many files

  • Thread starter Thread starter nickdu
  • Start date Start date
N

nickdu

If I have a multi-threaded application which processes files from a
directory, what might be the best way to divy those files up to multiple
threads? I don't want the threads to be colliding on the same files. Once a
thread is processing a file I need to make sure another thread doesn't start
processing it. Also, if there are a million or so files in the directory I'm
thinking that Directory.GetFiles() might not be the best way to access that
list. Can I open the directory as a file and read the entries that way?
--
Thanks,
Nick

(e-mail address removed)
remove "nospam" change community. to msn.com
 
Hi Nick,

Thank you for using Microsoft Managed Newsgroup Service, I'm Zhi-Xin Ye,
it's my pleasure to work with you on this issue.

Peter had provided a good suggestion on this issue.

If the directory you deal with contains large amount of files, the
Directory.GetFiles() method would take a long time to return the file name
list, in this case, you can call the FindFirstFile/FindNextFile/FindClose
API to enumerate all the files in the directory, in the loop, you can call
ThreadPool.QueueUserWorkItem() method to create or awake a thread in thread
pool to process the file.

Documents for your references:

FindFirstFile
http://www.pinvoke.net/default.aspx/kernel32/FindFirstFile.html

ThreadPool.QueueUserWorkItem Method
http://msdn.microsoft.com/en-us/library/system.threading.threadpool.queueuse
rworkitem.aspx

If anything is unclear or you have any concerns, please feel free to let me
know.

Have a great day!

Best Regards,
Zhi-Xin Ye
Microsoft Managed Newsgroup Support Team

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

Note: MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 2 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions. Issues of this
nature are best handled working with a dedicated Microsoft Support Engineer
by contacting Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Thanks Peter.

Nothing has to be the way it currently is but based on some loose
requirements the current design seems appropriate. Let me explain the
application a bit more which hopefully will shed some more light on the
subject.

We've got files generated from several (maybe a hundred or so) servers.
These files are ETL files which contain trace information. We've got a
processing server which processes these files. Processing involves using the
Win32 API's OpenTrace()/ProcessTrace()/CloseTrace() (via PInvoke) to gain
access to the data in these trace files and then inserting appropriate rows
into a trace database. The database is on a different server.

So as you can see we've got a bunch of machines on one end generating
traces, our processing server in the middle, and a DB at the other end. We
could have our processing engine support socket connections allowing the
servers generating the trace to send the data directly to the processing
process. However, this would mean our processing engine would always have to
be online and that's not desirable. So asynchronous behavior is one
requirement. We could use queuing, which is kind of what we have anyway, but
I would rather go with the file system queuing as opposed to MSMQ or MQSeries.

Our application is mostly IO bound. In terms of the large number of files I
mentioned in the directory, this should only occur at times when our
processing engine has to be down for some period of time. However, when this
condition occurs I don't want to pay a huge cost for this if for some reason
the file system API's I'm using don't behave nicely in this condition. For
instance, often we see Windows Explorer hang for minutes trying to display a
folder with a huge amount of files. Having a more efficient pull model
interface like IEnumFile() (or something like that) might make more sense in
this case. So I was just wondering if opening the directory myself and
enumerating the entries might be more efficient. Of course as files are
processed and remove this might become unmanageable.

I've also ran into issues when opening the files that show up in the
directory we're processing. Sometimes the process creating the file is not
done with it yet so the processing engine encounters and error trying to open
it exclusively. Not sure if there is a prescribed way to handle this type of
workload. I believe some of the unix type processes (sendmail maybe) work
this way by processing files that show up in a directory.
--
Thanks,
Nick

(e-mail address removed)
remove "nospam" change community. to msn.com
 
Hi Nick,

If the file amount in the directory is very large, it's more efficient to
enumerate the files using the FindFirstFile/FindNextFile/FindClose APIs, it
allows you access to each file path as it is discovered, rather than
forcing you to wait until all files have been found.
Sometimes the process creating the file is not
done with it yet so the processing engine encounters an error trying
to open it exclusively.

You can open the file in a try-catch block, so that when an file is in use,
an exception will be catched without interrupting your program.

If you have any question or concerns, please feel free to let me know.

Best Regards,
Zhi-Xin Ye
Microsoft Managed Newsgroup Support Team

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
 
Back
Top