PDF, DOC, RTF File parser

  • Thread starter Thread starter Harry
  • Start date Start date
H

Harry

I need to read resumes from PDF, DOC, RTF and text file and fill in
the relevent fields in database.
My application is based on dotnetnuke (asp.net)
can anyone help me if something is available.
 
What exactly are you trying to do?

If you just want to convert these different file formats into plain-text
files that can be manipulated, that's possible (but complicated). The DOC
format is proprietary, so you'd have to programatically open the document in
Word and either copy the document's text into the clipboard, or
programatically do a "Save As" to a plain text file. You can do this using
automation (VBA), but you'll have to have Word running on the server. You
can convert an RTF file by opening the file in an RTF control and then
retrieving the plain text from that box. I'm not sure about PDF, but I
believe there are third-party components available for translating PDF
files.

If you want to have the program automagically interpret the relevant
information and fill it into the correct database field without human
intervention, good luck -- computers just aren't very good at parsing
natural languages. Resumes will be particularly hard to parse because the
information may be structued in any number of ways and they tend to be
written in short sentence fragments. If you really want to try, do some
research on context-free (CF) parsers. Two good, recent textbooks on the
subject are Jurafsky & Martin, "Speech & Language Processing," and Allen,
"Natural Language Understanding." (Both available from Amazon.com.)

A much, much better alternative would be to ask people to submit their
resume information through a structured format -- such as by filling in
fields on a Web form. Or hiring clerical help to take regular resumes and
copy/paste the information into the database.

--Robert Jaccobson
 
Robert,
thanks for detailed reply.
I am looking for second one --context-free (CF) parsers -- if not then
we can go for structured format.
is there any thirdparty parser available for resume which I can use in
asp.net application.
 
I'm not aware of any such parsers, so you'll have to roll your own. Let me
reemphasize, though, that I think doing so would be a waste of effort --
parsers are not very capable at parsing English documents, especially
specialized documents like resumes.
 
Back
Top