Scraped content via WebRequest: Fixing mis-rendered characters systemically?

  • Thread starter Thread starter Ken Fine
  • Start date Start date
K

Ken Fine

I have a portion of a web page that I am scraping via .NET's WebRequest
object. Code and page URL is below. Some characters are being mis-rendered
when the string representing the page portion is returned: these are various
entity characters that do not translate correctly into renderable HTML.Can
someone suggest a systemic way that is built into the .NET framework's Text
classes to fix this so it renders correctly on a web page?

Thanks,
-KF

public partial class UweekHome : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
litHTMLfromScrapedPage.Text = GetHtmlPage("http://uweek.org");
}

public String GetHtmlPage(string strURL)
{
// the html retrieved from the page
String strResult;
WebResponse objResponse;
WebRequest objRequest = System.Net.HttpWebRequest.Create(strURL);
objResponse = objRequest.GetResponse();

using (StreamReader sr =
new StreamReader(objResponse.GetResponseStream()))
{
strResult = sr.ReadToEnd();
int pos1 = strResult.IndexOf("<slstart>", 0);
int pos2 = strResult.IndexOf("<storylist>", pos1);
int pos3 = strResult.IndexOf("</storylist>", pos2);
strResult = strResult.Substring(pos2 + 11, pos3 - pos2 + 11);
sr.Close();
}



return strResult;
}
}
 
Some characters are being mis-rendered
when the string representing the page portion is returned: these are various
entity characters that do not translate correctly into renderable HTML.

Can you give an example?
 
Em-dashes, En-dashes, curly quotes, and the like:

Didja hear the one about the economist who became a stand-up comic? Yoram
Bauman is an instructor for the Program on the Environment, but most Tuesday
nights find him at the Comedy Underground, cracking wise about ? no joke ?
economics.

When the Husky men?s basketball team heads to Greece Aug. 27 for a series of
exhibition games, they?ll be traveling with Socrates. That?s because in
their off-court time, they?ll take part in a classics class that focuses on
the man who is often called the father of western philosophy.
 
Em-dashes, En-dashes, curly quotes, and the like:

Didja hear the one about the economist who became a stand-up comic? Yoram
Bauman is an instructor for the Program on the Environment, but most Tuesday
nights find him at the Comedy Underground, cracking wise about ? no joke ?
economics.

When the Husky men?s basketball team heads to Greece Aug. 27 for a series of
exhibition games, they?ll be traveling with Socrates. That?s because in
their off-court time, they?ll take part in a classics class that focuses on
the man who is often called the father of western philosophy.







- Show quoted text -

Hi Ken

try to set encoding

Encoding enc = Encoding.GetEncoding(1252);
using (StreamReader sr =
new StreamReader(objResponse.GetResponseStream(), enc)) {
.....

Hope it helps
 
Hi Ken,

I agree with Alexey here that you need to specify the encoding to read the
response since it's by default using UTF-8.

However, using 1252 will only work for ASCII encoding (for example, Western
encoding). The more reliable way is to get the correct encoding from
HttpWebResponse. Please refer to following discussion thread for more
information:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=58840&SiteID=1

Regards,
Walter Wang ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
Hi Ken,

I agree with Alexey here that you need to specify the encoding to read the
response since it's by default using UTF-8.

However, using 1252 will only work for ASCII encoding (for example, Western
encoding). The more reliable way is to get the correct encoding from
HttpWebResponse. Please refer to following discussion thread for more
information:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=58840&SiteID=1

the page he is trying to get is in 1252
 
Back
Top