HTML Screen Scraping Q

  • Thread starter Thread starter George Durzi
  • Start date Start date
G

George Durzi

I'd like to screen-scrape company news from cbsmarketwatch. Consider this
URL as an example:
http://cbs.marketwatch.com/tools/quotes/news.asp?symb=MSFT When you browse
there, there's two sections, 1. News Headlines for Microsoft Corporation,
and 2. Press Releases about Microsoft Corporation.

I've already written the code to post to the page and grab the HTML into a
string. If you browse the source of the above linked webpage, here's an
excerpt of how the news headlines would look:

<TABLE WIDTH="100%" CELLPADDING="0" CELLSPACING="0" border="0" ID="Table1">
<?xml version="1.0" encoding="UTF-16" ?>
<TR class="tb01">
<TD COLSPAN="4" height="20">
<A class="lk03"
href="/tools/quotes/news.asp?siteid=mktw&symb=MSFT&amp;property=sid&amp;valu
e=3140&amp;doctype=2006">News Headlines for Microsoft Corporation (MSFT)</A>
</TD>
</TR>
<TR>
<TD NOWRAP="TRUE" width="110" valign="top">12:58pm 02/13/04</TD>
<TD valign="top">
<A class="lk01"
HREF="/news/story.asp?guid=%7B01470A47%2D936B%2D444D%2DB6FC%2DD111A9E61EE4%7
D&amp;siteid=mktw&amp;">Market Snapshot</A>
</td>
</TR>
</TABLE>

What I'd like to do is create a dataset (or anything else I can bind to a
datagrid) containing the news items.
I noticed that the news items are enclosed in a table which has <?xml
version="1.0" encoding="UTF-16" ?>
Would this allow me an easy way to navigate this HTML?
What tools can I use to do this? Regular Expressions?

Any tips are greatly appreciated.
 
I hear ya ... What I was trying to do at first was find an RSS feed which
took in a stock ticker as a parameter and gave me back some news headlines.
All I could find when I chased that one down was a beta RSS feed that a
developer at Yahoo had created. Unfortunately, it's no longer around, and
there's nothing like it.

This is gonna be used for an intranet application, and the news headlines
are gonna link to the actual cbsmarketwatch pages. I will also be crediting
the source of the newsfeed.

Hopefully that should cover it.
 
Back
Top