Bert said:
In Joel
I use Calibre, but it doesn't do too well on most PDFs, since they
apparently don't contain much in the way of formatting hints for Calibre
to use.
PDF is actually a programming language. So it's more than
"just a few hints". It contains all the commands necessary
to produce vector graphics or bitmap graphics or fonts
with glyphs or... you name it.
So to say it doesn't have "formatting hints", is disingenuous.
It works at an entirely different level (i.e. not a word
processor format).
If you've ever tried to write conversion programs, to
go from one format to another, you discover rather quickly,
that some of the things you've been asked to translate,
have no exact equivalent in the other environment. And
then your "translation" looks pretty dopey.
It's like when someone asks for a "PDF to Microsoft Word"
translator. Well, PDF is mostly focused on graphics primitives.
The "letters" may not be associated with each other any more
(you can have trouble telling where words begin and end,
where spaces should go and so on). When I hear someone ask
for a "PDF to Microsoft Word" translator, it just makes me wince
thinking about it.
Like, imagine converting this (easily expressed precisely
in a PDF), into Microsoft Word. PDF has the ability to display
text characters along a mathematical path. Does Microsoft
Word have that capability ? There might not be an
exact way to do this in Microsoft Word (without cheating,
and just inserting a picture).
http://candlvarsityjackets.com/images/patches/arc-reverse-arc-straight-text.jpg
This is one of my favorites. Originally written, by hand,
in the PostScript language. Then distilled to PDF (another
programming language). You can open this first link easily
in Acrobat, to see what it looks like. Engineers use this
for certain electrical design problems. Normally, the university
book store charges an "arm and leg" for a sheet of
this graph paper, which is why people were enamored
with printing their own copies on a laser printer.
(Thirty years ago, we would have bought this dude a beer!)
http://ecee.colorado.edu/~kuester/smith/smith.pdf
(The PostScript version is next - you can open this in Notepad,
and read the comments by the author... This is one of
my favorite hand-hewn diagrams, because it's so damn
clever. Try translating this into Microsoft Word. Because
this is hand-hewn, the code isn't obfuscated.)
http://ecee.colorado.edu/~kuester/smith/smith.ps
Doing translations is not easy - especially when
every user who uses your tool comments "Huh! It didn't
do a very good job". Well, of course not, they're not
even conceptually close. They can both have "text strings"
in them, but in the case of PDF, that's not essential.
In fact, some PDFs store text as a bunch of tiny pixmaps,
which is most annoying as a technique. If you wanted to
reverse translate such a PDF to Microsoft Word, you'd have
to do OCR to get there.
Some of the things done in PDF, are done on purpose to make
the documents less useful (i.e. so you can't steal the content).
For example, one such hack I undid, it causes the "text copy"
buffer to be filled with garbage, if a user attempts to copy
a passage from the document. (That is different than the
"do not copy" security setting - it's an additional form of security.)
It's when the PDF is purposely "disrupted" with that objective
in mind, that translation could be pretty damn difficult.
So when you look at the actual programming language used in
a PDF, most of the code in there is to "obfuscate" what
is going on - documents would be much smaller, byte-wise,
if there weren't so many creative efforts to stop translation
or copying.
(All about PDF - 1310 pages)
http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_1-7.pdf
(The predecessor, PostScript printer language - years ago,
the only way to get a copy of this, was purchase a printed copy.
The irony... I still have mine, but the back is cracked.
Now you can download the damn thing, for free.)
http://www.adobe.com/products/postscript/pdfs/PLRM.pdf
As a result of my own pitiful efforts to do translators,
I'm most impressed when someone else does one, and they
even get half-close to a successful translation. Some
things are easy to translate, and some... not.
Paul