How to perform XPath queries on HTML?

  • Thread starter Thread starter Siegfried Heintze
  • Start date Start date
S

Siegfried Heintze

JTidy is a java library that will populate an XML DOM from an HTML string.
The XML DOM has XPATH. Is there a similar library for C# and VB.NET
programers that will allow me to perform XPATH queries on HTML?

Also, what is the name of the HTTP client that will allow me to fetch the
HTML from a web site?

thanks,
Siegfried
 
You can only perform XPath operations on XML, so, by definition, it can't be
used with HTML. But, if you have XHTML, then you can simply load up an
XMLDomDocument with this XHTML and use XPath on it that.

-Scott
 
Siegfried said:
JTidy is a java library that will populate an XML DOM from an HTML string.
The XML DOM has XPATH. Is there a similar library for C# and VB.NET
programers that will allow me to perform XPATH queries on HTML?

Use HtmlAgilityPack. It has a basic, lenient HTML parser and implements
IXPathNavigable and a basic DOM, so it can be searched using an XPath.

http://www.codeplex.com/htmlagilitypack
Also, what is the name of the HTTP client that will allow me to fetch the
HTML from a web site?

WebRequest & WebResponse should be able to do this for you, no? Do you
have more specific questions about WebRequest.Create / etc?

-- Barry
 
Scott said:
You can only perform XPath operations on XML, so, by definition, it can't be
used with HTML.

A handy thing about the XPathNavigator class in .NET is that if you can
implement it (i.e. derive and implement its abstract methods) for your
arbitrary tree-shaped data structure, then you can query it using XPath.

-- Barry
 
But, since HTML may not be a well-formed tree structure, wouldn't you have
problems querying it?
 
Scott said:
But, since HTML may not be a well-formed tree structure, wouldn't you have
problems querying it?

Like I said earlier, HtmlAgilityPack uses a very lenient but
deterministic HTML parser. It can make a tree out of just about any
source HTML; as long as the XPath query works on one instance of the
server side's generated HTML (assuming it's generated otherwise why
automate the querying?), then it should work on subsequent instances.

In other words, even if the HTML is malformed and results in a
non-compliant tree, the formation of the tree itself is deterministic
and so it ought to be consistently queryable.

-- Barry
 
Back
Top