Parsing Challenge: WORD & eml files

  • Thread starter Thread starter Jerry
  • Start date Start date
J

Jerry

Here's a challenge that I am almost to the point of cutting and pasting - if
it wastnt for the fact that I have about 5000 files to use.
Someone did a directory structure with Disease names like Cancer, Stroke,
etc and in each directory is a number of files in either WORD or .eml
format.
I need to open each file and go through it, parsing out the reference,
volume, text, title, etc. The parsing is going to be difficult but the
ability to go through, grab the directory name, determine the file type,
open the file and then grab a line and parse into data is somewhat beyond me
with the WORD format and especially the .EML format (outlook express).

Anyone have any hints, tips or software (grin)

Jerry
 
Hi Jerry,

This could be fun. There's VBA code that walks through a directory tree
finding files, but for a one-off job I can never bother to look it up.
Instead I just open a command prompt and use a command like this

DIR "D:\Top level folder\*.eml" /B /S > "D:\Myfolder\EML_List.txt"

to create a text file in which each line contains the path and name of
one eml file. Then I just have my code read this file line by line and
deal with each file it finds. (And the same for Word, using *.doc
instead of *.eml). Something like this air code:

Dim lngFN As Long
Dim strFN As String
Dim strFileSpec As String

lngFN = FreeFile()
strFN = "D:\Myfolder\EML_List.txt"
Open strFN For Read As #lngFN

Do Until EOF(lngFN)
Line Input #lngFN, strFileSpec
'parse the file
...
...
Loop 'next file

Close #lngFN


The actual parsing can be anywhere from quite simple (if the information
in the files is laid out in an absolutely consistent structure) to
practically impossible (if you're trying to extract precise information
from discursive text). If the information you need is in attachments to
the email messages in the EML files, that's an extra layer of
complication.

In general, EML files are best treated as text files. Open some with a
text editor such as Notepad and you'll see the simple structure of the
email headers, followed hopefully with plain text contents.

For anything but the very simplest text parsing, the best tool in my
experience is a regular expression engine (like a wildcard search on
steroids). There's quite a good one included in VBScript (the RegExp
object). If you search the web and the Windows Scripting help files
you'll find information on using it.

As for the Word document, you'll need to learn to automate Word to open
each document, parse it and return your data. A good place to start
looking is http:word.mvps.org.
 
Back
Top