How to remove HTML special characters and tags from a Google RSSfeed?

  • Thread starter Thread starter JimTheAverage
  • Start date Start date
J

JimTheAverage

I want to write myself a little app that grabs rss feeds from google
and makes them human readable for me - sending me text messages of
words that I list as interesting in the app.

The problem is all of the special characters and HTML tags in the
data. Here is an example....

<table border="0" cellpadding="2" cellspacing="7" style="vertical-
align:top;"><tr><td width="80" align="center" valign="top"><font
style="font-size:85%;font-family:arial,sans-serif"><a href="http://
news.google.com/news/url?fd=R&amp;sa=T&amp;url=http%3A%2F
%2Fwww.google.com%2Fhostednews%2Fafp%2Farticle
%2FALeqM5gUDc1OjvV66ZWAJ0sVkcvxEAeC_g&amp;usg=AFQjCNHOZ9w9BRN3suaUKX5NtYyvXwu1Hg"><img
src="http://nt0.ggpht.com/news/tbn/JId-xOOMV8PujM/6.jpg" alt=""
border="1" width="80" height="80" /><br /><font size="-2">AFP</font></
a></font></td><td valign="top" class="j"><font style="font-size:
85%;font-family:arial,sans-serif"><br /><div style="padding-top:
0.8em;"><img alt="" height="1" width="1" /></div><div class="lh"><a
href="http://news.google.com/news/url?fd=R&amp;sa=T&amp;url=http:/
%2Fwww.bloomberg.com%2Fapps%2Fnews%3Fpid%3D20601087%26sid
%3DaTDY1sk77rm0%26pos%3D6&amp;usg=AFQjCNFpZEDNQOe3IGg-ywaHRs4QbhY0-
g"> said:
<font size="-1"><b><font color="#6f6f6f">Bloomberg</font></b></
font><br /><font size="-1">Nov. 16 (Bloomberg) -- Mitsubishi UFJ
Financial Group Inc. may announce Japan's biggest secondary share
sale this week as it prepares for stricter global capital rules,
according to a survey of analysts. The nation's largest bank by
<b>...</b></font><br /><font size="-1"><a href="http://news.google.com/
news/url?fd=R&amp;sa=T&amp;url=http%3A%2F%2Fwww.reuters.com%2Farticle
%2FrbssFinancialServicesAndRealEstateNews
%2FidUST33395120091116&amp;usg=AFQjCNEmza8jguldmVwYZHQbCgrNWeEr9g">MUFG
shares fall nearly 5 pct on share issue plan</a><font size="-1"
color="#6f6f6f"><nobr>Reuters</nobr></font></font><br /><font
size="-1"><a href="http://news.google.com/news/url?
fd=R&amp;sa=T&amp;url=http%3A%2F%2Fonline.wsj.com%2Farticle
%2FSB10001424052748704431804574537433259423484.html%3Fmod
%3Dgooglenews_wsj&amp;usg=AFQjCNHnBOh1rt7oohVtBHRwwRmTYEr4Qg">Mitsubishi
UFJ Weighs Raising $11 Billion said:
fd=R&amp;sa=T&amp;url=http%3A%2F%2Fwww.thestreet.com%2Fstory
%2F10626902%2F1%2Fmitsubishi-ufj-ponders-stock-sale.html%3Fcm_ven
%3DGOOGLEFI&amp;usg=AFQjCNGhb0g_PzOJYIRcpSDeA6Xv7hBu8w">Mitsubishi UFJ
Ponders Stock Sale</a><font size="-1"
color="#6f6f6f"><nobr>TheStreet.com</nobr></font></font><br /><font
size="-1" class="p"><a href="http://news.google.com/news/url?
fd=R&amp;sa=T&amp;url=http%3A%2F%2Fwww.marketwatch.com%2Fstory
%2Fmitsubishi-ufj-plans-11-bln-share-issue-
report-2009-11-14&amp;usg=AFQjCNHCJpoOUHe_gaVZIV1gk0j42MkDPg"><nobr>MarketWatch</
nobr></a>&nbsp;-<a href="http://news.google.com/news/url?
fd=R&amp;sa=T&amp;url=http%3A%2F%2Ftopnews.us%2Fcontent%2F28359-mufg-
looking-raise-11-billion-through-public-offering-common-
shares&amp;usg=AFQjCNESYgvq1nwyiLyos9-_bdhvmHn18w"><nobr>TopNews
United States</nobr></a>&nbsp;-<a href="http://news.google.com/news/
url?fd=R&amp;sa=T&amp;url=http%3A%2F%2Fwww.forexyard.com%2Fen
%2Freuters_inner.tpl%3Faction
%3D2009-11-16T013439Z_01_T122800_RTRIDST_0_MARKETS-JAPAN-STOCKS-
UPDATE-2&amp;usg=AFQjCNFXGg95rPlOBIA2bKc0jd2Q3yFk2g"><nobr>Forexyard</
nobr></a></font><br /><font class="p" size="-1"><a class="p"
href="http://news.google.com/news/more?
ned=us&amp;topic=b&amp;ncl=dPy1T0LK7URGP1MyRL2FQL4se2MkM"><nobr><b>all
69 news articles&nbsp;&raquo;</b></nobr></a></font></div></font></td></
tr></table>

I want to make this human readable. To do so, I first need to replace
all HTML special chars with their human-readable equiv and then remove
all HTML tags.

A quick look at http://www.degraeve.com/reference/specialcharacters.php
shows that removing all of these special chars is no simple task.

Is there an easy way of doing this that I am overlooking?
 
Is there an easy way of doing this that I am overlooking?

The maybe easiest way to parse HTML is to use a (hidden) webbrowser
control in a form e.g.:

{
WebBrowser webBrowser = new WebBrowser();
webBrowser.DocumentCompleted += OnDocumentCompleted;
webBrowser.DocumentText = "<html><body><table><tr><td><a href=
\"test.html\"><b>blabla &gt;&amp;&lt; bla</b></a></td></tr></table></
body></html>";
}

private void OnDocumentCompleted( object sender,
WebBrowserDocumentCompletedEventArgs e )
{
MessageBox.Show( ((WebBrowser)sender).Document.Body.OuterText );
}

Another (easy) possibility is by using the html agility pack (free
library somewhere on the web):

using ( StringWriter sw = new StringWriter() )
{
new HtmlToText().ConvertHtml(htmlString, sw);
sw.Flush();
return sw.ToString();
}

Hope it helps.
 
The maybe easiest way to parse HTML is to use a (hidden) webbrowser
control in a form e.g.:

{
  WebBrowser webBrowser = new WebBrowser();
  webBrowser.DocumentCompleted += OnDocumentCompleted;
  webBrowser.DocumentText = "<html><body><table><tr><td><a href=
\"test.html\"><b>blabla &gt;&amp;&lt; bla</b></a></td></tr></table></
body></html>";

}

private void OnDocumentCompleted( object sender,
WebBrowserDocumentCompletedEventArgs e )
{
  MessageBox.Show( ((WebBrowser)sender).Document.Body.OuterText );

}

Another (easy) possibility is by using the html agility pack (free
library somewhere on the web):

using ( StringWriter sw = new StringWriter() )
{
  new HtmlToText().ConvertHtml(htmlString, sw);
  sw.Flush();
  return sw.ToString();

}

Hope it helps.

Ops ... sorry, forgot to include the "HtmlToText" class for the html
agility pack:

public class HtmlToText
{
public HtmlToText()
{
}

public void Convert(string path, StringWriter sw)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
ConvertTo(doc.DocumentNode, sw);
}

public void ConvertHtml(string html, StringWriter sw)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
ConvertTo(doc.DocumentNode, sw);
}

private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
foreach(HtmlNode subnode in node.ChildNodes)
ConvertTo(subnode, outText);
}

public void ConvertTo(HtmlNode node, TextWriter outText)
{
string html;
switch(node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;

case HtmlNodeType.Document:
ConvertContentTo(node, outText);
break;

case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;

// get text
html = ((HtmlTextNode)node).Text;

// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;

// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
outText.Write(HtmlEntity.DeEntitize(html));
break;

case HtmlNodeType.Element:
switch(node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}

if (node.HasChildNodes)
ConvertContentTo(node, outText);

break;
}
outText.Write(" ");
}
}

The code is based on some example included in the html agility pack.
 
Back
Top