Wanted - Batch extraction of links from HTML files

  • Thread starter Thread starter Simon
  • Start date Start date
S

Simon

anyone know of a tool that will produce a 'report' of the links in a series
of HTML documents?

I am converting a heap of HTML to a keynote file and want a ready reference
to what links are in each source file

Ideally, the output would list the links by file
eg file 1 - link 1, link 2, link 3
file 2 - link 1, link 2, link 3
file 3 - link 1, link 2, link 3

TIA

S
 
Simon said:
anyone know of a tool that will produce a 'report' of the links in a series
of HTML documents?

I am converting a heap of HTML to a keynote file and want a ready reference
to what links are in each source file

Ideally, the output would list the links by file
eg file 1 - link 1, link 2, link 3
file 2 - link 1, link 2, link 3
file 3 - link 1, link 2, link 3

Jetlinks - it does a nice job of importing URLs- the best I've found so
far. (IIRC it doesn't recognize #xxx at the end of a bookmark when
importing from text or HTML files.) You can import URLs into separate
categories/folders: file 1, file 2 etc. or whatever. . .

When all the URL's are imported the Jetlinks file can be exported as a
Netscape bookmark file = an HTML file. That offers a number of
possibilities. . .

One possibility that might not be obvious is the use of Netscape 4x -
you can copy bookmark folders or the entire file from the right click
menu when you are in the edit bookmarks mode. Paste (into a text
file/email/treepad etc.) will paste a text list of the URL's you have
copied and the folders/subfolders they are in. Very quick, very handy.
I've noted one glitch - more info about that if needed. . .

Susan
 
Incidently,I have programmed a HTML links extractor called
WebExtractor, It's freeware and available from my website:

http://mcky_boyz.tripod.com

Only 200KB in size.

It functions exactly how the way you want the links to be.
 
Simon said:
anyone know of a tool that will produce a 'report' of the links in a series
of HTML documents?

I am converting a heap of HTML to a keynote file and want a ready reference
to what links are in each source file

Ideally, the output would list the links by file
eg file 1 - link 1, link 2, link 3
file 2 - link 1, link 2, link 3
file 3 - link 1, link 2, link 3

TIA

S

Hi,
Visit http://www.erols.com/waynesof/bruce.htm and take a look at
HTMSTRIP. Of the many things this console utility can do, it can
provide a list of links from an HTML file. You'll need the /A option,
which allows the following;

/A=spec says how to handle <A...> links. "spec" is one of:
SITE = give site name FSITE = give site name (full url)
SITEFN = give site as footnote FSITEFN = full url site as footnote
NONE = don't show (default) SYMBOL = use defined symbol instead

Here's an example use;

J:\>htmstrip /A=FSITEFN ctv.htm
HTMSTRIP (/? for help) (c)2002 Bruce Guthrie, Wayne Software Rev
08/10/2002
Reading J:\HTMSTRIP.INI...
02:04:03: Reading J:\HTMSTRIP.INI... 11 filters
Options: /WIDTH=80 /-FORCE /RULE=- /-RSPACE /A=FSITEFN /IMG=NONE
/MAP=NONE
/EXT=.OUT /INDENT /INPUT /-SPACES /TABLE /-WARNINGS /-ALL
/BORDER=T /BUFF=1 /Cj:\ /ATTR=-H-S
02:04:03: Reading J:\HTMSTRIP.INI /CP1... 331 entity lookups
02:04:04: Processing HTML files... Press Esc to abort early
02:04:04: CTV.HTM --> CTV.OUT
Input file size: 46,788 bytes (100%)
02:04:05: Done

Joe
http://groups.yahoo.com/group/JoeCaverlysProgrammingStuff
 
Incidently,I have programmed a HTML links extractor called
WebExtractor, It's freeware and available from my website:

Only 200KB in size.
It functions exactly how the way you want the links to be.

What I find is often a problem for this sort of software is line
wrapping. Suppose one was searching a text file of newsgroup posts
and has ....

Would it "extract"

http://mcky_boyz.tripod.com/mckysbreakout/

or

http://mcky_boyz.tripod.com/mckysbreakout/webex.htm ?

Regards, John.
 
Back
Top