Is it possible to programmatically proxy a web request through c#?

  • Thread starter Thread starter dtown22
  • Start date Start date
D

dtown22

I am trying to make a copy of all the files involved in a web request
from a specific website (i.e. you enter groups.google.com, and hit
enter once and then basically make a copy of all the files involved in
loading that webpage).

At first I figured that I could do everything programmtically, (i.e.
open groups.google.com in my c# app, and them loop through all the
hrefs and grab them one by one), but I would like to be able to write
an application which is browser independent. So in a sense, my
application would need to sniff out the files involved and copy them
to a local dir.

I have looked into packet sniffing, but that seems a little complex,
as I am not concerned with each individual packet, but rather each
individual file that is downloaded as a result of loading a web page.

Does anyone have any ideas? Also, I forgot to note, that I am rather
new to c#, so if there is some obvious way of implementing this, I
apologize in advance. Thanks!
 
You are correct in your first approach. You need to download the HTML for a
page, then search the HTML for anchors, links, images, etc. and download
each of these independently. For each you will also need to review the
target URLs and make sure they are reset to look at your disk location
rather than the original site.

Additionally, you should review the use of robots.txt and the generally
accepted guidelines for creating robots (which is pretty much what you are
doing - creating a crawler). You should adhere to these rules to avoid
impacting the web sites of others. It would be very bad form to start
crawling somebody's web site as you could potentially adversely affect their
performance or cause errors for other visitors.
 
Back
Top