System.Net.Webclient screen scraping: how to gracefully handle 403 (and other) errors?

Guest · Jan 10, 2007

I've written a very small ASP.NET page to scrape thousands of pages of
content based on database IDs. It loops through a dataset to get the IDs. It
worked well in testing but now I am getting an annoying 403 error that
causes the script to abort halfway through my download.

I am wondering if there is a way in ASP.NET to have my code ignore 403
errors and other network errors, catch the error, and iterate to the next ID
in the dataset rather than aborting the whole job.

My code appears below. Thank you in advance.

-KF

string strConnection;

strConnection = ConfigurationSettings.AppSettings["connwhatever"];

SqlConnection conn = new SqlConnection(strConnection);

string query = // [my query];

SqlDataAdapter a = new SqlDataAdapter(query, conn);

DataSet s = new DataSet();

a.Fill(s);

int counter = 0;

foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.WebClient wc = new WebClient();

string strData =
wc.DownloadString("http://whatever.org/article.asp?articleid=" +
dr[0].ToString());

FileStream fstream = new FileStream(@"c:\whateverpath\" + dr[0].ToString() +
".htm", FileMode.Create, FileAccess.Write);

StreamWriter stream = new StreamWriter(fstream);

stream.Write(strData);

stream.Close();

fstream.Close();

bruce barker · Jan 10, 2007

please read chapter on try/catch

-- bruce (sqlwork.com)

Guest · Jan 10, 2007

Understand try/catch generally. What event(s) should I be trying to catch?

Thank you,
-KF

bruce barker said:
please read chapter on try/catch

-- bruce (sqlwork.com)

I've written a very small ASP.NET page to scrape thousands of pages of
content based on database IDs. It loops through a dataset to get the IDs.
It worked well in testing but now I am getting an annoying 403 error that
causes the script to abort halfway through my download.

I am wondering if there is a way in ASP.NET to have my code ignore 403
errors and other network errors, catch the error, and iterate to the next
ID in the dataset rather than aborting the whole job.

My code appears below. Thank you in advance.

-KF

string strConnection;

strConnection = ConfigurationSettings.AppSettings["connwhatever"];

SqlConnection conn = new SqlConnection(strConnection);

string query = // [my query];

SqlDataAdapter a = new SqlDataAdapter(query, conn);

DataSet s = new DataSet();

a.Fill(s);

int counter = 0;

foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.WebClient wc = new WebClient();

string strData =
wc.DownloadString("http://whatever.org/article.asp?articleid=" +
dr[0].ToString());

FileStream fstream = new FileStream(@"c:\whateverpath\" +
dr[0].ToString() + ".htm", FileMode.Create, FileAccess.Write);

StreamWriter stream = new StreamWriter(fstream);

stream.Write(strData);

stream.Close();

fstream.Close();

Click to expand...

Steven Cheng[MSFT] · Jan 10, 2007

Hello KF,

Based on your description, you're using the webclient class to request many
web pages programmatically in ASP.NET page code. However, since some page
may raise some exception, your client loop code in ASP.NET page break,
correct?

As for the 403 error, it is normally caused by the security authorization
checking at server-side fails. I'm not sure whether there is any other
particular scenario here, however, if what you want is simply captuer and
ignore such error and continue the loop, you can just add a try catch block
around your webclient class's downloadXXX method call and if any exception
captured you can simply ignore it and skip the current loop. e.g.

=======================
foreach (DataRow dr in s.Tables[0].Rows)

{

counter++;

System.Net.WebClient wc = new WebClient();

try
{

string strData =
wc.DownloadString("http://whatever.org/article.asp?articleid=" +
dr[0].ToString());

}catch(Exception ex)
{
//ignore and continue the loop
}

....................................

}
=========================

Does this work for your scenario?

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

==================================================

Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscriptions/support/default.aspx.

==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.

Steven Cheng[MSFT] · Jan 10, 2007

For webclient or HttpWebRequest, it normally will throw a
System.Net.WebException, however, any exception can be handled by the super
class "Exception". So you can use either

try
{
}catch(Exception)
{

}

or

try
{
}catch(WebException)
{

}

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

This posting is provided "AS IS" with no warranties, and confers no rights.

Gaurav Vaish \(www.EdujiniOnline.com\) · Jan 10, 2007

You can specify error documents for specific error-codes.

In your web.config, add the following entries in <system.web> section:

<customErrors>
<error statusCode="403" redirect="403.aspx"/>
</customErrors>

Note that generally, 403 would be given by the web server and not be the
ASP.Net engine. At times, when the authentication fails, 403 may be returned
by an IHttpModule - like the authentication modules (NTML, Kerberos, Digest
etc).

Guest · Jan 10, 2007

This worked great for my scenario. Thanks very much to everyone for the
timely assistance.

-KF

System.Net.Webclient screen scraping: how to gracefully handle 403 (and other) errors?

Guest

bruce barker

Guest

Steven Cheng[MSFT]

Steven Cheng[MSFT]

Gaurav Vaish \(www.EdujiniOnline.com\)

Guest