How to read contents of html table with .net?

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I have a need to read the contents of an html table on a remote web page into
a variable. I guess this is called screen scraping but not sure. I'm not
sure where to start or what the best practices are to accomplish this. For
instance; I have a healthcare app that need to check a gov't we page for a
user's license no# periodically. There is no login and I can put the user
info in the request URL no problem but not sure how to read the response data
in the tables. What is the namespace and class(s) I should be looking at?
Nothing jumped out at me under System.Web.
Thanks, Jim
 
Hello, Jim!

JS> I have a need to read the contents of an html table on a remote web
JS> page into a variable. I guess this is called screen scraping but not
JS> sure. I'm not sure where to start or what the best practices are to
JS> accomplish this. For instance; I have a healthcare app that need to
JS> check a gov't we page for a user's license no# periodically. There is
JS> no login and I can put the user info in the request URL no problem but
JS> not sure how to read the response data in the tables. What is the
JS> namespace and class(s) I should be looking at?

After receiving table you can parse it. You can use XML parser for this ( System.Xml ).
--
Regards, Vadym Stetsyak
www: http://vadmyst.blogspot.com
 
Vadym Stetsyak said:
Hello, Jim!

JS> I have a need to read the contents of an html table on a remote web
JS> page into a variable. I guess this is called screen scraping but not
JS> sure. I'm not sure where to start or what the best practices are to
JS> accomplish this. For instance; I have a healthcare app that need to
JS> check a gov't we page for a user's license no# periodically. There is
JS> no login and I can put the user info in the request URL no problem but
JS> not sure how to read the response data in the tables. What is the
JS> namespace and class(s) I should be looking at?

After receiving table you can parse it. You can use XML parser for this
( System.Xml ).

Beware that most web pages aren't written with well formed, valid XML (HTML
isn't as strict as XML). The XML parser might not work in that case.
Googling for "screen scraping .NET" should get you some alternatives.

/claes
 
Jim said:
I have a need to read the contents of an html table on a remote web page into
a variable. I guess this is called screen scraping but not sure. I'm not
sure where to start or what the best practices are to accomplish this. For
instance; I have a healthcare app that need to check a gov't we page for a
user's license no# periodically. There is no login and I can put the user
info in the request URL no problem but not sure how to read the response data
in the tables. What is the namespace and class(s) I should be looking at?
Nothing jumped out at me under System.Web.

HtmlAgilityPack will take HTML (even malformed real-world HTML) and
return you a nice XML DOM to query.
 
Jim,

Will you try this one, if you need that field from a tag.

You need a textbox on a form and to set a reference to Microsoft.mshthml.
(It is tested)

\\\
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Net;
using System.IO;
using System.Windows.Forms;

namespace WindowsApplication4
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
textBox1.Multiline = true;
textBox1.ScrollBars = ScrollBars.Both;
//above only for showing the sample
mshtml.IHTMLDocument2 Doc = new mshtml.HTMLDocumentClass();
HttpWebRequest wbReq =
(HttpWebRequest)
WebRequest.Create("http://msdn.microsoft.com/");
HttpWebResponse wbResp =
(HttpWebResponse) wbReq.GetResponse();
WebHeaderCollection wbHCol = wbResp.Headers;
Stream myStream = wbResp.GetResponseStream();
StreamReader myreader = new StreamReader(myStream);
Doc.write(myreader.ReadToEnd());
Doc.close();
wbResp.Close();

//the part below is not completly done for all tags.
//it can (will be for sure) necessary to tailor that to your needs.

System.Text.StringBuilder sb = new System.Text.StringBuilder();
for (int i = 0; Doc.all.length - 1 > i; i++)
{
mshtml.IHTMLElement hElm =
(mshtml.IHTMLElement) Doc.all.item(i,i);
string hE = hElm.tagName.ToLower();
if (hE == "body" || hE == "html" || hE == "head")
{
if (hE != "")
{
sb.Append(hElm.innerText + Environment.NewLine);
}
}
}
textBox1.Text = sb.ToString();
}
}
}
///
I hope this helps,

Cor
 
Jim,

Before somebody else it, my previous answer without investigating the tabs,
than it is much easier.

\\\
using System;
using System.Drawing;
using System.Net;
using System.IO;
using System.Windows.Forms;
namespace WindowsApplication4
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
HttpWebRequest myReg = (HttpWebRequest)
WebRequest.Create("http://www.google.com");
HttpWebResponse myResp = (HttpWebResponse) myReg.GetResponse();
Stream myStream = myResp.GetResponseStream();
StreamReader myReader = new StreamReader(myStream);
textBox1.Text = myReader.ReadToEnd();
myResp.Close();
}
}
}
///

Cor
 
Back
Top