parsing html files

  • Thread starter Thread starter Philip Townsend
  • Start date Start date
P

Philip Townsend

Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...
 
Philip,

If all you are looking for is the title, then I would recommend using
Regular Expressions. It will just be more performant. If you need more
information from the object model, then I would use COM interop and create
an instance of MSHTML.HTMLDocument. This will allow you to load a document
into the object, and access the DOM.

Hope this helps.
 
Hi Philip

If it's only the title, I would just search for the <title> element as
Itai suggested. However, if you need more flexibility, there is a
library available on gotdotnet that is available to convert HTML to an
XML DOM. Using this, you can easily use XPath (I don't have the link).

Regards, Philipp
 
You have two options
1- Use Regular Expression.
2- Convert html into XHtml and load that document into XmlDom and
check for the title tag.

Regular Expression:

Match Title = Regex.Match(html, "<title>([a-z0-9\\s]*)</title>",
RegexOptions.IgnoreCase | RegexOptions.Multiline );
string strTitle = Title.Groups[1].Value;

and for converter that convert the html into xhtml see the below link
http://www.eggheadcafe.com/articles/20030317.asp

user that lib and convert your document into Xhtml and then load that
converted documented into XmlDom and search for title tag.

regards,
Zeeshan Anwar.



 
Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

I guess this might be too late, but I've just released version 1.2 of
an HTML 4 parser, imaginatively named "HTML Parser v1.2".

http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=2201&lngWId=10

Rgds,
 
Back
Top