How to perform XPath queries on HTML?

Siegfried Heintze · Dec 2, 2007

JTidy is a java library that will populate an XML DOM from an HTML string.
The XML DOM has XPATH. Is there a similar library for C# and VB.NET
programers that will allow me to perform XPATH queries on HTML?

Also, what is the name of the HTTP client that will allow me to fetch the
HTML from a web site?

thanks,
Siegfried

Scott M. · Dec 2, 2007

You can only perform XPath operations on XML, so, by definition, it can't be
used with HTML. But, if you have XHTML, then you can simply load up an
XMLDomDocument with this XHTML and use XPath on it that.

-Scott

Barry Kelly · Dec 3, 2007

Siegfried said:
JTidy is a java library that will populate an XML DOM from an HTML string.
The XML DOM has XPATH. Is there a similar library for C# and VB.NET
programers that will allow me to perform XPATH queries on HTML?

Use HtmlAgilityPack. It has a basic, lenient HTML parser and implements
IXPathNavigable and a basic DOM, so it can be searched using an XPath.

http://www.codeplex.com/htmlagilitypack

Also, what is the name of the HTTP client that will allow me to fetch the
HTML from a web site?

WebRequest & WebResponse should be able to do this for you, no? Do you
have more specific questions about WebRequest.Create / etc?

-- Barry

Barry Kelly · Dec 3, 2007

Scott said:
You can only perform XPath operations on XML, so, by definition, it can't be
used with HTML.

A handy thing about the XPathNavigator class in .NET is that if you can
implement it (i.e. derive and implement its abstract methods) for your
arbitrary tree-shaped data structure, then you can query it using XPath.

-- Barry

Scott M. · Dec 4, 2007

But, since HTML may not be a well-formed tree structure, wouldn't you have
problems querying it?

Barry Kelly · Dec 5, 2007

Scott said:
But, since HTML may not be a well-formed tree structure, wouldn't you have
problems querying it?

Like I said earlier, HtmlAgilityPack uses a very lenient but
deterministic HTML parser. It can make a tree out of just about any
source HTML; as long as the XPath query works on one instance of the
server side's generated HTML (assuming it's generated otherwise why
automate the querying?), then it should work on subsequent instances.

In other words, even if the HTML is malformed and results in a
non-compliant tree, the formation of the tree itself is deterministic
and so it ought to be consistently queryable.

-- Barry

How to XPATH on HTML?	2	Jan 29, 2008
What about the performance between LinQ to Xml and using the XPath	1	Oct 14, 2010
Escaping backslashes in XPath (C#)	2	Jun 30, 2006
XPath calculation problem 100.02 + 0.02 is not 100.04	2	Jan 20, 2006
Help with XPath Expressions	5	Jan 30, 2010
using Gridview xpath to get xml node	3	Jan 6, 2009
C# - XPath for XMLDataSource?	3	Feb 4, 2008
XPath query on XmlNode	2	Nov 26, 2006

How to perform XPath queries on HTML?

Siegfried Heintze

Scott M.

Barry Kelly

Barry Kelly

Scott M.

Barry Kelly

Ask a Question

Similar Threads