Parsing through HTML

  • Thread starter Thread starter BobRoyAce
  • Start date Start date
B

BobRoyAce

I've got a WebBrowser control on a form, called WebBrowser1, and,
through code, I navigate to a web page. Now, on this web page, among
many other things, there is a <table> with several rows, each having
five columns, two of which I care about. I know that I could write
string parsing code to find the <table>, then keep parsing through the
text of WebBrowser1.Document to find each "cell" that I care about,
and pull the text contained therein. However...

If this was XML, I could nicely grab the items I care about with some
yet-to-be-determined code. I am wondering if here is also a way to do
this with HTML. What would be a home-run would be if I could at least
do something like this:

Pseudocode...

TheTable = GetTheTable

For Each Row in TheTable.GetRows
sCol2Value = Row.Column(2)
sCol5Value = Row.Column(5)
Next

If this isn't easy to do in code, are there any "components" out there
to facilitate something like this?
 
Parse the HTML code in XML. xDocument.Parse(HTMLString) should do the trick
you are looking for.

xDocument is in .NET 3
 
xDocument is in .NET 3

That's contained in System.Xml.Linq...will have to look more into
that. However, for some reason, I can't add a reference to that
assembly to my project. It's grayed out for some reason. I do have a
reference to System.Xml, though. Does that keep me from adding a
reference to System.Xml.Linq?
 
That's contained in System.Xml.Linq...will have to look more into
that. However, for some reason, I can't add a reference to that
assembly to my project. It's grayed out for some reason. I do have a

Most likely, you have the target framework set to something other then 3.5
(2.0 probably).

You can check by right clicking on your project in solution explorer, select
properties -> compile and then select the Advanced Compile Options button.
There is a drop down on the dialog that lets you set the target framework.
 
Most likely, you have the target framework set to something other then 3.5
(2.0 probably).  

Exactly right...was targeting 2.0. Changed that, and now I can add a
reference to it. Thanks.
 
Ryan S. Thiele said:
Parse the HTML code in XML. xDocument.Parse(HTMLString) should do the
trick you are looking for.

xDocument is in .NET 3

This will only work with XHTML, but not with HTML, which is an SGML
application.
 
BobRoyAce said:
I've got a WebBrowser control on a form, called WebBrowser1, and,
through code, I navigate to a web page. Now, on this web page, among
many other things, there is a <table> with several rows, each having
five columns, two of which I care about. I know that I could write
string parsing code to find the <table>, then keep parsing through the
text of WebBrowser1.Document to find each "cell" that I care about,
and pull the text contained therein.

Take a look at MSHTML (provided by Microsoft) or another HTML parser:

Html Agility Pack
<URL:http://www.codeplex.com/htmlagilitypack>

SgmlReader
<URL:http://wiki.developer.mindtouch.com/Community/SgmlReader>
 
BobRoyAce said:
TheTable = GetTheTable

For Each Row in TheTable.GetRows
sCol2Value = Row.Column(2)
sCol5Value = Row.Column(5)
Next

If this isn't easy to do in code, are there any "components" out there
to facilitate something like this?

Yes there is, but there is some learning curve. Basically many things that
appear in HTML are elements, such as tables, which contains rows and
columns. You can use getElementsByName/getElementById to get to the table
object quickly. Here are some links:

Document Object Model
http://msdn.microsoft.com/en-us/library/ms533043(VS.85).aspx

HTML Elements
http://msdn.microsoft.com/en-us/library/ms533029(VS.85).aspx

TABLE Element | table Object
http://msdn.microsoft.com/en-us/library/ms535901(VS.85).aspx
 
Back
Top