HttpWebRequest Screen Scraping

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Please help - I'm having a similar problem to another post about making multiple requests to a site from a client app to perform screen scraping. Somehow I need to maintain and reuse the connection and/or session to perform a series of steps on a site.

The twist different from the other post is that the site requires a client-side certificate upon connection first. If the site does not think this has been done, then any request will fail. I am have been able to do that successfully with HttpWebRequest.ClientCertificates.Add(), but the rest of the code doesn't seem to work.

The steps that need to be done go like this...

1 - Connect to a site which requires a client-side cert.
2 - Log into the site with a user name and password (simple form screen scrape).
3 - Submit some information to the site (another simple form screen scrape).

This works fine manually entering the full URLs+data from IE one step at a time, but doesn't want to work in code programmatically. The code (with some delcaration and impl assumptions)...

CookieCollection cookies = null;

uri1 = new Uri("http://www.mysite.com/");

req1 = (HttpWebRequest)WebRequest.Create(uri1);
req1.KeepAlive = true;
req1.ClientCertificates.Add(cert);
req1.CookieContainer = new CookieCollection();
resp1 = req1.GetResponse();

if ((req1.HaveResponse) && (resp1.StatusCode = HttpStatusCode.OK))
{
if (resp1.Cookies.Count > 0)
cookies = resp1.Cookies;

uri2 = new Uri("http://www.mysite.com/login");

req2 = (HttpWebRequest)WebRequest.Create(uri2);
req2.Method = "POST";
req2.KeepAlive = true;
req2.ClientCertificates.Add(cert);
req2.CookieContainer = new CookieCollection();
req2.CookieContainer.Add(cookies);
...
// write login data to the request stream
...

resp2 = req2.GetResponse();
if ((req2.HaveResponse) && (resp2.StatusCode = HttpStatusCode.OK))
{
if (resp2.Cookies.Count > 0)
cookies = resp2.Cookies;

uri3 = new Uri("http://www.mysite.com/submitData");
req3 = (HttpWebRequest)WebRequest.Create(uri3);
req3.Method = "POST";
req3.KeepAlive = true;
req3.ClientCertificates.Add(cert);
req3.CookieContainer = new CookieCollection();
req3.CookieContainer.Add(cookies):

...
// write data for submitting to the request stream
...

resp3 = req3.GetResponse();
if ((req3.HaveResponse) && (resp3.StatusCode = HttpStatusCode.OK))
{
// ok, process the response
}
}
}

Note that if I do not add the certificate for requests 2 and 3, they will fail misserably; however, adding them only gives me a successful connection, not the actual submit that I wanted to execute. Also note that the URLs are slightly different, though the base is the same.

The site does set a session ID cookie ("JSESSIONID" - probably a jscript session), but even though I reuse it, the response for both request 2 and 3 come back with no cookies (a good indication the session has been lost). Any ideas?
 
1) Why are you creating a new cookiecollection, and adding cookies to
it for each request ? Why dont you reuse the same collection from the
first request ?

2) Maybe you are already doing this, but I didnt see in your code
where you are disposing of your response ( using Response.Close())

thanks

feroze
===========================
This post is provided as-is and confers no rights.
 
1. I'm not sure what you meant, but I'm assuming
req2.CookieContainer = cookies;
rather than
req2.CookieContainer = new CookieCollection();
req2.CookieContainer.Add(cookies);

If so, I suppose I can try it, though I'm not sure I understand the difference except there are not two separate instances - does that really make a difference?

2. Yes, I did close the response objects, though I did not disclose here in the code (sorry).
 
Back
Top