How to remove HTML special characters and tags from a Google RSSfeed?

JimTheAverage · Nov 16, 2009

I want to write myself a little app that grabs rss feeds from google
and makes them human readable for me - sending me text messages of
words that I list as interesting in the app.

The problem is all of the special characters and HTML tags in the
data. Here is an example....

<table border="0" cellpadding="2" cellspacing="7" style="vertical-
align:top;"><tr><td width="80" align="center" valign="top"><a href="http://
news.google.com/news/url?fd=R&sa=T&url=http%3A%2F
%2Fwww.google.com%2Fhostednews%2Fafp%2Farticle
%2FALeqM5gUDc1OjvV66ZWAJ0sVkcvxEAeC_g&usg=AFQjCNHOZ9w9BRN3suaUKX5NtYyvXwu1Hg"><img
src="http://nt0.ggpht.com/news/tbn/JId-xOOMV8PujM/6.jpg" alt=""
border="1" width="80" height="80" /> AFP</
a></td><td valign="top" class="j"> <div style="padding-top:
0.8em;"><img alt="" height="1" width="1" /></div><div class="lh"><a
href="http://news.google.com/news/url?fd=R&sa=T&url=http:/
%2Fwww.bloomberg.com%2Fapps%2Fnews%3Fpid%3D20601087%26sid
%3DaTDY1sk77rm0%26pos%3D6&usg=AFQjCNFpZEDNQOe3IGg-ywaHRs4QbhY0-

g"> said:
Bloomberg</

font> Nov. 16 (Bloomberg) -- Mitsubishi UFJ
Financial Group Inc. may announce Japan's biggest secondary share
sale this week as it prepares for stricter global capital rules,
according to a survey of analysts. The nation's largest bank by
... <a href="http://news.google.com/
news/url?fd=R&sa=T&url=http%3A%2F%2Fwww.reuters.com%2Farticle
%2FrbssFinancialServicesAndRealEstateNews
%2FidUST33395120091116&usg=AFQjCNEmza8jguldmVwYZHQbCgrNWeEr9g">MUFG
shares fall nearly 5 pct on share issue plan</a>Reuters <a href="http://news.google.com/news/url?
fd=R&sa=T&url=http%3A%2F%2Fonline.wsj.com%2Farticle
%2FSB10001424052748704431804574537433259423484.html%3Fmod
%3Dgooglenews_wsj&usg=AFQjCNHnBOh1rt7oohVtBHRwwRmTYEr4Qg">Mitsubishi

UFJ Weighs Raising $11 Billion said:
<a href="http://news.google.com/news/url?

fd=R&sa=T&url=http%3A%2F%2Fwww.thestreet.com%2Fstory
%2F10626902%2F1%2Fmitsubishi-ufj-ponders-stock-sale.html%3Fcm_ven
%3DGOOGLEFI&usg=AFQjCNGhb0g_PzOJYIRcpSDeA6Xv7hBu8w">Mitsubishi UFJ
Ponders Stock Sale</a>TheStreet.com <a href="http://news.google.com/news/url?
fd=R&sa=T&url=http%3A%2F%2Fwww.marketwatch.com%2Fstory
%2Fmitsubishi-ufj-plans-11-bln-share-issue-
report-2009-11-14&usg=AFQjCNHCJpoOUHe_gaVZIV1gk0j42MkDPg">MarketWatch</a> -<a href="http://news.google.com/news/url?
fd=R&sa=T&url=http%3A%2F%2Ftopnews.us%2Fcontent%2F28359-mufg-
looking-raise-11-billion-through-public-offering-common-
shares&usg=AFQjCNESYgvq1nwyiLyos9-_bdhvmHn18w">TopNews
United States</a> -<a href="http://news.google.com/news/
url?fd=R&sa=T&url=http%3A%2F%2Fwww.forexyard.com%2Fen
%2Freuters_inner.tpl%3Faction
%3D2009-11-16T013439Z_01_T122800_RTRIDST_0_MARKETS-JAPAN-STOCKS-
UPDATE-2&usg=AFQjCNFXGg95rPlOBIA2bKc0jd2Q3yFk2g">Forexyard</a> <a class="p"
href="http://news.google.com/news/more?
ned=us&topic=b&ncl=dPy1T0LK7URGP1MyRL2FQL4se2MkM">all
69 news articles »</a></div></td></
tr></table>

I want to make this human readable. To do so, I first need to replace
all HTML special chars with their human-readable equiv and then remove
all HTML tags.

A quick look at http://www.degraeve.com/reference/specialcharacters.php
shows that removing all of these special chars is no simple task.

Is there an easy way of doing this that I am overlooking?

ib.dangelmeyr · Nov 18, 2009

Is there an easy way of doing this that I am overlooking?

The maybe easiest way to parse HTML is to use a (hidden) webbrowser
control in a form e.g.:

{
WebBrowser webBrowser = new WebBrowser();
webBrowser.DocumentCompleted += OnDocumentCompleted;
webBrowser.DocumentText = "<html><body><table><tr><td><a href=
\"test.html\">blabla >&< bla</a></td></tr></table></
body></html>";
}

private void OnDocumentCompleted( object sender,
WebBrowserDocumentCompletedEventArgs e )
{
MessageBox.Show( ((WebBrowser)sender).Document.Body.OuterText );
}

Another (easy) possibility is by using the html agility pack (free
library somewhere on the web):

using ( StringWriter sw = new StringWriter() )
{
new HtmlToText().ConvertHtml(htmlString, sw);
sw.Flush();
return sw.ToString();
}

Hope it helps.

ib.dangelmeyr · Nov 18, 2009

The maybe easiest way to parse HTML is to use a (hidden) webbrowser
control in a form e.g.:

{
WebBrowser webBrowser = new WebBrowser();
webBrowser.DocumentCompleted += OnDocumentCompleted;
webBrowser.DocumentText = "<html><body><table><tr><td><a href=
\"test.html\">blabla >&< bla</a></td></tr></table></
body></html>";

}

private void OnDocumentCompleted( object sender,
WebBrowserDocumentCompletedEventArgs e )
{
MessageBox.Show( ((WebBrowser)sender).Document.Body.OuterText );

}

Another (easy) possibility is by using the html agility pack (free
library somewhere on the web):

using ( StringWriter sw = new StringWriter() )
{
new HtmlToText().ConvertHtml(htmlString, sw);
sw.Flush();
return sw.ToString();

}

Hope it helps.

Ops ... sorry, forgot to include the "HtmlToText" class for the html
agility pack:

public class HtmlToText
{
public HtmlToText()
{
}

public void Convert(string path, StringWriter sw)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
ConvertTo(doc.DocumentNode, sw);
}

public void ConvertHtml(string html, StringWriter sw)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
ConvertTo(doc.DocumentNode, sw);
}

private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
foreach(HtmlNode subnode in node.ChildNodes)
ConvertTo(subnode, outText);
}

public void ConvertTo(HtmlNode node, TextWriter outText)
{
string html;
switch(node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;

case HtmlNodeType.Document:
ConvertContentTo(node, outText);
break;

case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;

// get text
html = ((HtmlTextNode)node).Text;

// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;

// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
outText.Write(HtmlEntity.DeEntitize(html));
break;

case HtmlNodeType.Element:
switch(node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}

if (node.HasChildNodes)
ConvertContentTo(node, outText);

break;
}
outText.Write(" ");
}
}

The code is based on some example included in the html agility pack.

How to remove HTML special characters and tags from RSS feed data.	0	Nov 16, 2009
WebBrowser content	3	Nov 22, 2009
GOOGLE NEWS PARSER	10	Jul 24, 2004
External data from HTML document	2	Dec 12, 2009
Hey Guys, This is Wierd . . .	2	Oct 3, 2003
System.NullReferenceException: Object reference not set to an inst	2	Feb 1, 2005
Your First Net Casting Company with true VISION	1	Mar 2, 2007
How to position image displayed from a link?	1	Feb 26, 2009

How to remove HTML special characters and tags from a Google RSSfeed?

JimTheAverage

ib.dangelmeyr

ib.dangelmeyr

Ask a Question

Similar Threads