pdf to word convertor

  • Thread starter Thread starter jacky
  • Start date Start date
For freeware, there's a pdftotext tool in the Ghostscript distribution. You
lose the formatting though, and it's .txt, not .doc.

If you've access to a machine running Mac OS X, TextLightning.app is
wonderful---shareware though, but it attempts to parse the location of text to
reconstruct paragraphs.

The only similar program for Windows is BCL Drake, a commercial plug-in for the
full-version of Adobe Acrobat---it does tables though, which TL doesn't.

William
(ob. discl., I was a beta tester for TextLightning.app)
 
Tanner,

Do you have any working download links for the program (Clickcat-P2H)?
The download links on the above page don't do anything except return to
the page I am already on, ie. http://www.pdf-to-html.com/downloads.html

I'm looking for a PDF2HTML converter but what I've found with all the
ones tried, even the Adobe site, is that they convert _without_ the
graphics. That defeats the purpose. Does this one do so? I've tried
so many that I don't remember if I tried this one. Though the
extremely large dl size probably scared me away (13megs!).

It does say this: "no conversion of vector graphics to VML (lines and
vector graphics)". Are all graphics considered vector ones? And what
about this "VML"??

Thanks!
 
In message:[email protected],
fitwell said:
I'm looking for a PDF2HTML converter but what I've found with all the
ones tried, even the Adobe site, is that they convert _without_ the
graphics. That defeats the purpose. Does this one do so? I've tried
so many that I don't remember if I tried this one. Though the
extremely large dl size probably scared me away (13megs!).

Try with pdftohtml at http://pdftohtml.sourceforge.net/. It will convert
images if you have Ghostscript installed in the system
(http://www.cs.wisc.edu/~ghost/); you can even choose the output format...
 
In message:[email protected],


Try with pdftohtml at http://pdftohtml.sourceforge.net/. It will convert
images if you have Ghostscript installed in the system
(http://www.cs.wisc.edu/~ghost/); you can even choose the output format...

Yes, this one I remember seeing, but to tell you the truth couldn't
make out which file to dl. Also, since it mentioned Unix on the page,
I couldn't tell if it would be a good app for me to have.

I mean, look at the language on the site:

"Pdftohtml is a tool based on the Xpdf package which translates pdf
documents into html format." I mean, couldn't they have written this
in plain English. Does console mean strictly that it's command line,
or not? That's something I've never been clear on.

And then look at this:

"Development Status: 4 - Beta
Environment: Console (Text Based)
Intended Audience: End Users/Desktop
License: GNU General Public License (GPL)
Natural Language: English
Programming Language: C++
Topic: Text Processing"

Hey, I may be a power user, but I'm a power user NEWBIE. I don't
understand things in either of the above semi-technical language any
better than a non-power user newbie might.

Then there are two versions to dl:

pdftohtml pdftohtml-0.36 June 23, 2003 - Download
windows binary pdftohtml-0.36 win32 June 23, 2003 - Download

Which one would be best, do you know?

I have Win98SE.

Thanks so much. Appreciate the info! I converted so many tutorials to
PDF that I need to convert back to html but the sites have
disappeared. PDF is too slow when scanning data and it often crashes
on my system, so much better now that I have an easy way to index all
my tutorial and other documents much easier to have in original html
format.
 
"Development Status: 4 - Beta
Environment: Console (Text Based)

Means no gui (graphical user interface - no "pointing and clicking").
It's run from a dos box.
Intended Audience: End Users/Desktop
License: GNU General Public License (GPL)
Natural Language: English
Programming Language: C++
Topic: Text Processing"

Hey, I may be a power user, but I'm a power user NEWBIE. I don't
understand things in either of the above semi-technical language
any better than a non-power user newbie might.

Then there are two versions to dl:

pdftohtml pdftohtml-0.36 June 23, 2003 - Download
windows binary pdftohtml-0.36 win32 June 23, 2003 - Download

Which one would be best, do you know?

The second one...with win32 in the name.
 
Means no gui (graphical user interface - no "pointing and clicking").
It's run from a dos box.


The second one...with win32 in the name.

Thanks, Tiger. So console _does_ mean command-line. And the second
one is it, eh? Okay. Will give it whirl.

Did you read the message re ActualDrawing? I was quite impressed with
it, very much so. But have yet to find v2.2 (that's because of the
practice of using same name in all versions. Makes task extremely
difficult).

Cheers!
 
Did you read the message re ActualDrawing? I was quite impressed
with it, very much so. But have yet to find v2.2 (that's because
of the practice of using same name in all versions. Makes task
extremely difficult).
Missed that message. As I said, though it's installed, I've yet to try
it. Is the one you have the freeware version? Again, if not, I can
upload the freeware version to abf if you like.
 
No sorry. I've bookmarked this but never dl'd because of the large size. Do
you need search /&replace functionality? If not, there are a number of
"print to tiff" programs out there. Essentially, your doc is rendered as a
multi-page tiff. Might make it easier to store and work with.
 
Missed that message. As I said, though it's installed, I've yet to try
it. Is the one you have the freeware version? Again, if not, I can
upload the freeware version to abf if you like.

No, the one I managed to find is not freeware. It's an older archived
version than the one currently available but it turned out to be
payware, too. :oP

I'd really, really appreciate it if you could post it there, to ABF.
I _might_ eventually find a copy of the free v2.2, but it would mean
hours of looking. I know, because when developers bring out all
versions under the same, identical name, one has to weed through many
to find a particular one. Not so if they put the # in the name
itself. i.e., look at this case below, as an example:

MP3-Info extension v3.3.17 had EXE of name: MP3ext33b17.exe.
A later version, v3.4b21 has this name: MP3ext34b21.exe

Let say that v3.4b21 was actually shareware. No problem, as long as
one knows the EXE name of old, freeware version, very easy to find if
someone stored it somewhere. (I know because I do this all the time,
hunting down freeware for apps that went shareware at some point.) In
this example, one would just do a simple search on the earlier
"MP3ext33b17.exe".

p.s., that's how I found GNMIDI, the midi medley maker where one
stitches midis together. The only freeware 32bit midi concatenator
out there with GUI, it seems.

However, I'm not having any luck with ActualDrawing. All versions
seem to have been named "ActualDrawing.exe", so after wading through
many hits, I have found so far only older one but it's not old enough.

So, I'd really appreciate it if you would post the one you have.
Hopefully it does what the new one does to any sort of degree. The
demo I'm trialing is pretty awesome, just perfect for modifying saved
html pages. I just now had a case where it took me 40 minutes to
modify a saved page - I needed to add crucial data. Then I did
something and overwrote it somehow, so it's gone! <argggh>. That
would not have happened if I'd been editing in AD because I would have
been done in less than 5 minutes rather than 40, and less chance to
get into stupid mode; I had a few minutes where I got lost in keeping
some of the coding straight! <tearing hair out> (Ah, the joys of
hard-coding!!!)

Thanks much!
 
No, the one I managed to find is not freeware. It's an older archived
version than the one currently available but it turned out to be
payware, too. :oP

I'd really, really appreciate it if you could post it there, to ABF.
I _might_ eventually find a copy of the free v2.2, but it would mean
hours of looking. I know, because when developers bring out all
versions under the same, identical name, one has to weed through many
to find a particular one. Not so if they put the # in the name
itself. i.e., look at this case below, as an example:

MP3-Info extension v3.3.17 had EXE of name: MP3ext33b17.exe.
A later version, v3.4b21 has this name: MP3ext34b21.exe

Let say that v3.4b21 was actually shareware. No problem, as long as
one knows the EXE name of old, freeware version, very easy to find if
someone stored it somewhere. (I know because I do this all the time,
hunting down freeware for apps that went shareware at some point.) In
this example, one would just do a simple search on the earlier
"MP3ext33b17.exe".

p.s., that's how I found GNMIDI, the midi medley maker where one
stitches midis together. The only freeware 32bit midi concatenator
out there with GUI, it seems.

However, I'm not having any luck with ActualDrawing. All versions
seem to have been named "ActualDrawing.exe", so after wading through
many hits, I have found so far only older one but it's not old enough.

So, I'd really appreciate it if you would post the one you have.
Hopefully it does what the new one does to any sort of degree. The
demo I'm trialing is pretty awesome, just perfect for modifying saved
html pages. I just now had a case where it took me 40 minutes to
modify a saved page - I needed to add crucial data. Then I did
something and overwrote it somehow, so it's gone! <argggh>. That
would not have happened if I'd been editing in AD because I would have
been done in less than 5 minutes rather than 40, and less chance to
get into stupid mode; I had a few minutes where I got lost in keeping
some of the coding straight! <tearing hair out> (Ah, the joys of
hard-coding!!!)

Thanks much!

Oops, sorry we're in the wrong thread, guys. I posted a passing
comment to Tiger which started this, my fault. We'll post any future
message back to the other thread ('kay, Tiger?). (Just one little
correction before finishing - filename incorrectly stated above,
correct one is: "ActualDrawingSetup.exe".)

Cheers everyone, and sorry 'bout that!
 
In message:[email protected],
fitwell said:
Yes, this one I remember seeing, but to tell you the truth couldn't
make out which file to dl. Also, since it mentioned Unix on the page,
I couldn't tell if it would be a good app for me to have.

I mean, look at the language on the site:

"Pdftohtml is a tool based on the Xpdf package which translates pdf
documents into html format." I mean, couldn't they have written this
in plain English. Does console mean strictly that it's command line,
or not? That's something I've never been clear on.

Historically, console is the correct name. Command line is the name given to
the applications that work by interacting through a command processor that
reads lines of text commands written at a command prompt. There comes the
name command line interface or CLI.
And then look at this:

"Development Status: 4 - Beta
Environment: Console (Text Based)
Intended Audience: End Users/Desktop
License: GNU General Public License (GPL)
Natural Language: English
Programming Language: C++
Topic: Text Processing"

Hey, I may be a power user, but I'm a power user NEWBIE. I don't
understand things in either of the above semi-technical language any
better than a non-power user newbie might.

Hey, that's standard metadata info. What's wrong with going to the
sourceforge help center (there is one) where you can read the different
classification they use? ;-) BTW, a good place to learn about all these
classifications and find our about the software out there that can be used
freely under Unix/Linux[1] and or win32 systems is http://freshmeat.net/.
Then there are two versions to dl:

pdftohtml pdftohtml-0.36 June 23, 2003 - Download
windows binary pdftohtml-0.36 win32 June 23, 2003 - Download

Which one would be best, do you know?

I have Win98SE.

Win32, all legacy Microsoft operating systems in the win9x line, and all the
presently sold Windows NT based systems provide a programming API (that's
the abbreviation for applications programmer interface called win32 that
uses 32-bit word alignment memory allocation. (No patronizing but
you did say you are learning the power user trade! :-)

[1] There is an ongoing confusion about calling Linux a Unix, while in
reality it is a good *imitation*. That is, it is a *Unix-like* operating
system. If it were Unix, SCO would have won the suit a long time ago. BSD
*is* Unix, but the proprietary parts in dispute were removed ten years ago.
Thanks so much. Appreciate the info! I converted so many tutorials to
PDF that I need to convert back to html but the sites have
disappeared. PDF is too slow when scanning data and it often crashes
on my system, so much better now that I have an easy way to index all
my tutorial and other documents much easier to have in original html
format.

You are welcome. there is one caveat that can bite you back.

Because of the nature of PDF, all linebreaks will be converted to soft
linebreaks, <br> tags. Even long vertical space will be reduced to one
break! You need to go through the text and place real paragraph marks by
hand, and then replace the softbreaks by spaces or whatever. Do the first
stage with a good text editor such as the free Crimson Editor
(http://www.crimsoneditor.com/). Later, as you will load a copy in Word,
select text paragraph by paragraph and apply autoformat, that way you can
control the final result and the style applied to the text. Or replace all
<br> tags with spaces, but this would destroy any lists in the text.

Alejo
 
In message:[email protected],
[snip]

Historically, console is the correct name. Command line is the name given to
the applications that work by interacting through a command processor that
reads lines of text commands written at a command prompt. There comes the
name command line interface or CLI.

Thanks for the info. Much appreciated.

[snip]
Hey, that's standard metadata info. What's wrong with going to the
sourceforge help center (there is one) where you can read the different
classification they use? ;-) BTW, a good place to learn about all these
classifications and find our about the software out there that can be used
freely under Unix/Linux[1] and or win32 systems is http://freshmeat.net/.

If I had even just one more thing to learn, brain would implode. My
plate is usu. on overflow so I do what I can only and the above job
just isn't on the priority list. If we stopped to go and learn about
everything we don't know, we'd get nothing done as we'd bet bogged
down in details. I have to choose what's important and what's not.
Thanks for the advice; I know it was very well-intentioned and
appreciate the thought.
[1] There is an ongoing confusion about calling Linux a Unix, while in
reality it is a good *imitation*. That is, it is a *Unix-like* operating
system. If it were Unix, SCO would have won the suit a long time ago. BSD
*is* Unix, but the proprietary parts in dispute were removed ten years ago.

You lost me, but I take your word for it. Thanks.
You are welcome. there is one caveat that can bite you back.

Because of the nature of PDF, all linebreaks will be converted to soft
linebreaks, <br> tags. Even long vertical space will be reduced to one
break! You need to go through the text and place real paragraph marks by
hand, and then replace the softbreaks by spaces or whatever. Do the first
stage with a good text editor such as the free Crimson Editor
(http://www.crimsoneditor.com/). Later, as you will load a copy in Word,
select text paragraph by paragraph and apply autoformat, that way you can
control the final result and the style applied to the text. Or replace all
<br> tags with spaces, but this would destroy any lists in the text.

Ahhhh, knew it was too good to be true. Nope. Then I'm trashing the
PDF files and doing without. Those that I can find the original
source HTML files for, great; those sites that are gone, are gone.
Too many to worry about. For sure, I'll no longer convert any in
future to PDF! PDF is too much trouble when you're trying to find
info quickly and when I'm opening/closing, everything keeps crashing.

I needed a more-or-less batch conversion utility to handle
graphics-rich PDFs and wouldn't have minded _some_ manual processing,
but there are just too many in number to deal with. I have too much
on my plate at any given time, more so now. And, of course, with the
number of graphics on each PDF, all the PDF to HTML apps didn't work
as they seemed to handle text only. Ah well, I know when to give up
graciously, too, and the moment has come re PDF>HTML.

Thanks just the same. I learned a lot.

Cheers!
 
Back
Top