website offline reader (aka site ripper)

  • Thread starter Thread starter Mike
  • Start date Start date
WinHTTrack
MightyKitten

[(Maybe) OT]
Anyone know of one that can get around the 'robots.txt' that stops
robots/spiders from ripping a site?

I've been trying to download and archive the BlackViper *indows XP site
http://www.blackviper.com
but can't do it with HTTrack...

TIA,
Wiseguy
 
Charles Cox said:
WinHTTrack
MightyKitten

[(Maybe) OT]
Anyone know of one that can get around the 'robots.txt' that stops
robots/spiders from ripping a site?

I've been trying to download and archive the BlackViper *indows XP site
http://www.blackviper.com
but can't do it with HTTrack...

TIA,
Wiseguy

Although I've not used them myself, I think there are three ways to get
round "robots.txt".

1. Under "Set Options > Scan Rules", add "robots.txt" to the list of files
to be excluded from the scan.

2. Under "Set Options > Spider" there is a menu for dealing with
"robots.txt" (Try c)
a. follow robot.txt rules
b. robots.txt except wizards
c. no robots.txt rules

3. If the above do not work, you could use Xenu to make a complete list of
the links and enter the list in HTTrack to download them individually. In
HTTrack click "Set Options > Limits" and set the Maximum Mirroring Depth to
1 and the Maximum External Depth to Zero. I just did a check with Xenu and
there are 1429 links on the Viper site -- no doubt some of these can be
discarded.

In case you do not have it, the URL for Xenu is
http://home.snafu.de/tilman/xenulink.html

I'm using HTTrack v3.05 -- other versions may vary)

(I tried downloading Viper with Sitesnagger without success)

===

Frank Bohan
¶ Never tell your computer that you're in a hurry.
 
Charles Cox said:
Mike wrote:
Anyone know of a decent freeware solution?
WinHTTrack
MightyKitten

[(Maybe) OT]
Anyone know of one that can get around the 'robots.txt' that stops
robots/spiders from ripping a site?

I've been trying to download and archive the BlackViper *indows XP
site http://www.blackviper.com
but can't do it with HTTrack...

TIA,
Wiseguy

Although I've not used them myself, I think there are three ways to
get round "robots.txt".

1. Under "Set Options > Scan Rules", add "robots.txt" to the list of
files to be excluded from the scan.

2. Under "Set Options > Spider" there is a menu for dealing with
"robots.txt" (Try c)
a. follow robot.txt rules
b. robots.txt except wizards
c. no robots.txt rules

3. If the above do not work, you could use Xenu to make a complete
list of the links and enter the list in HTTrack to download them
individually. In HTTrack click "Set Options > Limits" and set the
Maximum Mirroring Depth to 1 and the Maximum External Depth to Zero. I
just did a check with Xenu and there are 1429 links on the Viper site
-- no doubt some of these can be discarded.

In case you do not have it, the URL for Xenu is
http://home.snafu.de/tilman/xenulink.html

I'm using HTTrack v3.05 -- other versions may vary)

(I tried downloading Viper with Sitesnagger without success)

===

Frank Bohan
¶ Never tell your computer that you're in a hurry.


Thank you very much, Frank.

I'll try your solution and Xenu also!

Wiseguy

Don't judge a man until you've walked a mile in his shoes... At least then
he'll be a mile away and barefoot.
 
Back
Top