Screen Scraping Issue

  • Thread starter Thread starter Knoxy
  • Start date Start date
K

Knoxy

Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:

The remote server returned an error: (500) Internal Server Error

Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();

while(httpRes.GetResponseStream().Read(buffer,0,buffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buffer));
}

httpRes.Close();

return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}



Cheers,
Andrew
 
Cheers for the info Chris, appreciated.

At minute, really looking for what I'm doing wrong in my code though
:-)

Anyone any ideas?


Chris said:
Try using Simon Mourier's Html Agility Pack:
http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack

It's excellent for screen scraping and very easy to use. It'll also
parse malformed HTML and pages containing code.
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:

The remote server returned an error: (500) Internal Server Error

Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();

while(httpRes.GetResponseStream().Read(buffer,0,buffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buffer));
}

httpRes.Close();

return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}



Cheers,
Andrew
 
In fact... it seems to break when I pass anything as a querystring
parameter...

eg: url.aspx?param=val

Does that help at all?

Cheers for the info Chris, appreciated.

At minute, really looking for what I'm doing wrong in my code though
:-)

Anyone any ideas?


Chris said:
Try using Simon Mourier's Html Agility Pack:
http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack

It's excellent for screen scraping and very easy to use. It'll also
parse malformed HTML and pages containing code.
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:

The remote server returned an error: (500) Internal Server Error

Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();

while(httpRes.GetResponseStream().Read(buffer,0,buffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buffer));
}

httpRes.Close();

return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}



Cheers,
Andrew
 
Knoxy said:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:

The remote server returned an error: (500) Internal Server Error

Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();

while(httpRes.GetResponseStream().Read(buffer,0,buffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buffer));
}

httpRes.Close();

return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
Hi Andrew,

Not sure about your particular problem, but could you not write your
function as:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

System.IO.StreamReader result = new
System.IO.StreamReader(httpRes.GetResponseStream());

try
{
return result.ReadToEnd();
}
finally
{
httpRes.Close();
}
}
catch (Exception ex)
{
return "";
}
}

As in, use the built in types and not worry about doing your own
buffering?

Damien
 
Cheers Damien - yeah, that does seem a little simpler :-)

I'm still fairly stumped on this one mind - do i need to do anything
with the url querystring data before I use it or something? Just breaks
Knoxy said:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:

The remote server returned an error: (500) Internal Server Error

Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();

while(httpRes.GetResponseStream().Read(buffer,0,buffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buffer));
}

httpRes.Close();

return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
Hi Andrew,

Not sure about your particular problem, but could you not write your
function as:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

System.IO.StreamReader result = new
System.IO.StreamReader(httpRes.GetResponseStream());

try
{
return result.ReadToEnd();
}
finally
{
httpRes.Close();
}
}
catch (Exception ex)
{
return "";
}
}

As in, use the built in types and not worry about doing your own
buffering?

Damien
 
Update:
I've just got back onto this and tried using the webclient class
instead...

try
{
System.Net.WebClient httpWeb = new System.Net.WebClient();

return System.Text.Encoding.UTF8.GetString(httpWeb.DownloadData(url));
}
catch (Exception ex)
{
return "";
}

I still get the same problem though :-( The ol' "The remote server
returned an error: (500) Internal Server Error" when the page works
fine when i browse to the actual page I'm trying to scrape.

I'm stumped, anyone out there? :-)

Cheers Damien - yeah, that does seem a little simpler :-)

I'm still fairly stumped on this one mind - do i need to do anything
with the url querystring data before I use it or something? Just breaks
Knoxy said:
Hi guys,
I've got this working but I have issues when there is any kind of c#
coding on the page that I'm trying to scrape (pages within my site -
its for a print page view basically), I get this error:

The remote server returned an error: (500) Internal Server Error

Now, I've stepped into the page that its calling, and it doesnt come
across an error. Any ideas? Code below:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

byte [] buffer = new byte[1024];
System.Text.StringBuilder sb = new System.Text.StringBuilder();

while(httpRes.GetResponseStream().Read(buffer,0,buffer.Length) != 0)
{
sb.Append(System.Text.Encoding.UTF8.GetString(buffer));
}

httpRes.Close();

return sb.ToString();
}
catch (Exception ex)
{
return "";
}
}
Hi Andrew,

Not sure about your particular problem, but could you not write your
function as:

private string ReadHtmlFromUrl (string url)
{
try
{
System.Net.WebRequest httpReq = System.Net.WebRequest.Create(url);

System.Net.WebResponse httpRes = httpReq.GetResponse();

System.IO.StreamReader result = new
System.IO.StreamReader(httpRes.GetResponseStream());

try
{
return result.ReadToEnd();
}
finally
{
httpRes.Close();
}
}
catch (Exception ex)
{
return "";
}
}

As in, use the built in types and not worry about doing your own
buffering?

Damien
 
Thus wrote Knoxy,
Update:
I've just got back onto this and tried using the webclient class
instead...
try
{
System.Net.WebClient httpWeb = new System.Net.WebClient();
return
System.Text.Encoding.UTF8.GetString(httpWeb.DownloadData(url));
}
catch (Exception ex)
{
return "";
}
I still get the same problem though :-( The ol' "The remote server
returned an error: (500) Internal Server Error" when the page works
fine when i browse to the actual page I'm trying to scrape.

I'm stumped, anyone out there? :-)

HTTP 500 Server Error means, um, server error. Without any insight into what
happens on the server-side, it's just guesswork ;-)

Having said that, you should send at least the HTTP headers User-Agent, Accept-Encoding,
and Accept.

Cheers,
 
Thanks for the reply Joerg

Just letting you know that it was a simple case of a user control used
by the page that was failing when I was reading in info from the
Request object.

But, I'd got another problem when I uploaded it to the live server
(when it worked on my machine) and loaded up a trace.axd page with my
errors written into it:

The underlying connection was closed: Unable to connect to the remote
server

Now I'm not passing any HTTP headers - will this cause a problem on a
secured web server and if so why? :-)

Cheers,
Knoxy
 
ps: by adding those headers...

httpWeb.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.2; .NET CLR 1.0.3705;)");
httpWeb.Headers.Add ("accept", "*/*");
httpWeb.Headers.Add ("accept-encoding", "gzip, deflate");

.... it still came back with the error:

The underlying connection was closed: Unable to connect to the remote
server


Any ideas why this might be different on the live server as opposed to
my dev machine?

Regards,
Knoxy
 
Back
Top