parsing html files

Philip Townsend · Mar 3, 2004

Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

Itai Raz · Mar 3, 2004

How about just parse the raw HTML and look for the word <title>?

Nicholas Paldino [.NET/C# MVP] · Mar 3, 2004

Philip,

If all you are looking for is the title, then I would recommend using
Regular Expressions. It will just be more performant. If you need more
information from the object model, then I would use COM interop and create
an instance of MSHTML.HTMLDocument. This will allow you to load a document
into the object, and access the DOM.

Hope this helps.

Philipp Sumi · Mar 3, 2004

Hi Philip

If it's only the title, I would just search for the <title> element as
Itai suggested. However, if you need more flexibility, there is a
library available on gotdotnet that is available to convert HTML to an
XML DOM. Using this, you can easily use XPath (I don't have the link).

Regards, Philipp

mikeb · Mar 3, 2004

Philipp said:
Hi Philip

If it's only the title, I would just search for the <title> element as
Itai suggested. However, if you need more flexibility, there is a
library available on gotdotnet that is available to convert HTML to an
XML DOM. Using this, you can easily use XPath (I don't have the link).

The project is SgmlReader by Chris Lovett (clovett). You can find it at:

http://www.gotdotnet.com/Community/...mpleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC

Zeeshan · Mar 4, 2004

You have two options
1- Use Regular Expression.
2- Convert html into XHtml and load that document into XmlDom and
check for the title tag.

Regular Expression:

Match Title = Regex.Match(html, "<title>([a-z0-9\\s]*)</title>",
RegexOptions.IgnoreCase | RegexOptions.Multiline );
string strTitle = Title.Groups[1].Value;

and for converter that convert the html into xhtml see the below link
http://www.eggheadcafe.com/articles/20030317.asp

user that lib and convert your document into Xhtml and then load that
converted documented into XmlDom and search for title tag.

regards,
Zeeshan Anwar.

_Andy_ · Mar 21, 2004

Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

I guess this might be too late, but I've just released version 1.2 of
an HTML 4 parser, imaginatively named "HTML Parser v1.2".

http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=2201&lngWId=10

Rgds,

Parse XML	3	Aug 5, 2008
reading text from .htm file	3	Dec 17, 2010
non-backtracking subexpression	1	Jan 2, 2010
The table HTML element	2	May 6, 2011
Is there any guideline in what section in aspx page where I should put my controls	1	Jan 13, 2011
How to add function to ie to parse html of current web page??	2	Jan 29, 2010
The Diable property for HTML server control does not work	1	May 8, 2011
Recommended approaches to parse HTML from a webclient call	1	Jul 16, 2006

parsing html files

Philip Townsend

Itai Raz

Nicholas Paldino [.NET/C# MVP]

Philipp Sumi

mikeb

Zeeshan

_Andy_

Ask a Question

Similar Threads