Web page screen scraping?

R

Ronald S. Cook

I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron
 
G

Guest

Hi Ronald,
what you basically have to do is use an object like the HttpRequest class,
which will be able to pull back the web page html for you to process, then
when the results come back you wil have to parse it. Since every website is
different you will have to write a custom scraper for each site you want to
scrape. Scraping will involve locating the pieces of information in the HTML
that you want to extract. If you get lucky and the webpage conforms to XHTML
standards then you can use the standard System.Xml objects to parse and find
the information you want which should be pretty simple. If the we page is
not XHTML compliant then you will have to perform some string manipulation,
using regular expressions or just plain old coding to find the correct
location in the HTML string you want.

Hope that helps
Mark Dawson
http://www.markdawson.org
 
S

Scott C

Ronald said:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron
I used an open source HTML parser a while back, but can't find it now.
I did find this, however, though I can't say I have any experience with it.

http://www.codeproject.com/dotnet/apmilhtml.asp


scott
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top