Code to Extract Text from PDF

  • Thread starter Thread starter SteveB
  • Start date Start date
S

SteveB

I have posted this question in the Visual Basic 2005 and Visual
Basic .Net 2005 discussion groups, also.

Hi. I am developing an application/web page with VB.Net that will
populate a SQL database from text extracted from PDF documents.
However, I am having a difficult time finding or developing the
appropriate code to convert the PDF streams into text strings. Has
anyone developed code to convert PDF's to Text?

I was able write a Perl script that would call a PDF to text
conversion application, but, I am having difficulty writing a
similiar
shell command in VB. Any ideas?


Once I have the text strings, I can parse the data easily into the
SQL
database tables.
 
1 get this to convert pdf2text
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl2-win32.zip
2 use this sub
Sub Pdf2Txt(ByVal options As String, ByVal pdfFile As String, ByVal txtFile
As String)
Dim arguments As String = options & " " & pdfFile & " " & txtFile
'make sure to provide the path with the pdfFile and the txtFile
System.Diagnostics.Process.Start("pdftotext.exe", arguments)
End Sub
 
I have tried many free libraries and had mixed results.
The only reliable avenue was using Aspose library.

http://www.aspose.com/categories/fi...aspose.pdf.kit-for-.net-and-java/default.aspx

My steps using Aspose
1. Init library, open file.
2. Loop thru each page
3 Collect page data/massage/post to database

A typical PDF file for me has 1,500 pages with no forms, 20+ elements per
page to extract.
Average time per document to extract, massage data, pass to database is
10-15 seconds
Apose can read each document in 3 seconds total.

Downside, it cost money yet it's a great investment as I have found out
because it has served me well on multiple projects.
 
  1  get this to convert pdf2textftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl2-win32.zip
2 use this sub
Sub Pdf2Txt(ByVal options As String, ByVal pdfFile As String, ByVal txtFile
As String)
        Dim arguments As String = options & " " & pdfFile & " "& txtFile
        'make sure to provide the path with the pdfFile and the txtFile
        System.Diagnostics.Process.Start("pdftotext.exe", arguments)
    End Sub









- Show quoted text -

I tried your suggestion and this app works great from a command line.
However, when I try to call pdftotext as you sugeested, I keep getting
an exception this error:

System.ComponentModel.Win32Exception was unhandled by user code
ErrorCode=-2147467259
Message="The system cannot find the file specified"
Source="System"
StackTrace:
at
System.Diagnostics.Process.StartWithShellExecuteEx(ProcessStartInfo
startInfo)
at System.Diagnostics.Process.Start()
at System.Diagnostics.Process.Start(ProcessStartInfo startInfo)
at System.Diagnostics.Process.Start(String fileName)
at _Default.Pdf2Txt(String options, String pdffile, String
textfile) in D:\documents and settings\srbray\My Documents\Visual
Studio 2005\Websites\RegCC\FRB.aspx.vb:line 48
at _Default.Submit1_Click(Object sender, EventArgs e) in D:
\documents and settings\srbray\My Documents\Visual Studio 2005\Websites
\RegCC\FRB.aspx.vb:line 27
at System.Web.UI.WebControls.Button.OnClick(EventArgs e)
at System.Web.UI.WebControls.Button.RaisePostBackEvent(String
eventArgument)
at
System.Web.UI.WebControls.Button.System.Web.UI.IPostBackEventHandler.RaisePostBackEvent(String
eventArgument)
at System.Web.UI.Page.RaisePostBackEvent(IPostBackEventHandler
sourceControl, String eventArgument)
at System.Web.UI.Page.RaisePostBackEvent(NameValueCollection
postData)
at System.Web.UI.Page.ProcessRequestMain(Boolean
includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)

This is my code:

Protected Sub Submit1_Click(ByVal sender As Object, ByVal e As
System.EventArgs) Handles Submit1.Click

Dim Path As String =
System.IO.Path.GetDirectoryName(File1.PostedFile.FileName)
Dim FileName As String
Dim MyText() As String
Dim NewFileName As String
Dim DataPath As String = "D:\Documents and Settings\srbray\My
Documents\Visual Studio 2005\WebSites\RegCC\Data\"
Dim ArchivePath As String = "D:\Documents and Settings\srbray
\My Documents\Visual Studio 2005\WebSites\RegCC\Archive\"
Dim MMM As String = MonthName(Month(Now()), True)
Dim YYYY As String = Year(Now())

'Create new archive directory.
My.Computer.FileSystem.CreateDirectory(ArchivePath & YYYY &
"\" & MMM)
ArchivePath = ArchivePath & YYYY & "\" & MMM & "\"

System.IO.Directory.SetCurrentDirectory(DataPath)

If Not File1.PostedFile Is Nothing And
File1.PostedFile.ContentLength > 0 Then
For Each oneFile As String In
My.Computer.FileSystem.GetFiles(Path,
FileIO.SearchOption.SearchTopLevelOnly, "*.pdf")
FileName = System.IO.Path.GetFileName(oneFile)
MyText = Split(FileName, ".")
NewFileName = MyText(0) & ".txt"
movepdffile(oneFile, DataPath & FileName)
Pdf2Txt("-layout", DataPath & FileName, DataPath &
NewFileName)
Next oneFile
Else
MsgBox("Please select the file(s) to upload.")
End If
'Insert code here to:
'Convert .pdf documents into .txt documents with
additional code to
'import data into the Float Reg CC database.


'Move .pdf files from working directory to archive
directory and delete .txt files.
'My.Computer.FileSystem.MoveFile(DataPath & FileName,
ArchivePath & FileName, True)

'My.Computer.FileSystem.DeleteFile(DataPath &
NewFileName)


End Sub
Sub Pdf2Txt(ByVal options As String, ByVal pdffile As String,
ByVal textfile As String)
Dim exe As String = "D:\xpdf-win32\pdftotext.exe"
Dim cmd As String = ("'" & exe & "' " & options & " '" &
pdffile & "' '" & textfile & "'")
MsgBox(cmd)
System.Diagnostics.Process.Start(cmd)
End Sub
Sub movepdffile(ByVal origin As String, ByVal destination As
String)
Try
My.Computer.FileSystem.MoveFile(origin, destination,
false)
Catch Exc As Exception
MsgBox("Error: " & Exc.Message)
End Try
MsgBox("Move is successful.")
End Sub

I believe I can make this work, but I am missing something minor....
 
Back
Top