html parser

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I want write a program with c# to pars a html file how ccan i do this with system.mshtml? or there is other way to do it p;ease help me?
 
Consider the HTML file as a simple XML.
You can do it by loading the HTML file in an XML document object--
XMLDocument doc = new XMLDocument();
doc.Load("yourfile");

Now you can parse through the doc object using XMLDocument class library (like SelectNodes etc.)

cheers
Aditya

----- majid wrote: -----

I want write a program with c# to pars a html file how ccan i do this with system.mshtml? or there is other way to do it p;ease help me?
 
You can't just load HTML into the XmlDocument. There is no guarantee that
the HTML is well formed. For example, can the following be loaded as XML?
No.

<html><body><img src="myimage.jpg"></body></html>


Aditya Ghuwalewala said:
Consider the HTML file as a simple XML.
You can do it by loading the HTML file in an XML document object--
XMLDocument doc = new XMLDocument();
doc.Load("yourfile");

Now you can parse through the doc object using XMLDocument class library (like SelectNodes etc.)

cheers
Aditya

----- majid wrote: -----

I want write a program with c# to pars a html file how ccan i do this
with system.mshtml? or there is other way to do it p;ease help me?
 
However, you can use the results of the load to reformat your HTML and reapply
it to the DOM.
This type of massaging can be extremely useful. However, some default
transformations on the HTML
first enable most HTML to be loaded. For instance, terminating elements that
don't have end tags like
<img> and <br>. For the harder cases the exceptions help:

<li>Some stuff
<li>Some more stuff

The above can be a bit harder to write default transforms for, so the exception
generated by the parser might
just help you create a transform for the particular HTML you are trying to load.
 
Have a look at these links, they may help you get started.

This is an SGML to XML converter written by Chris Lovett (MSFT), and source
code is included. Once you're (hopefully well formed) HTML is in XML, they
you can parse it with the Xml classes of .NET. That's how I would probably
do it.
http://www.gotdotnet.com/Community/...mpleGuid=b90fddce-e60d-43f8-a5c4-c3bd760564bc

I don't know anything about this link, but I ran across it while trying to
refind the above link. It may help you. It should show you how to use MSHTML
control.
http://www.itwriting.com/htmleditor/index.php

Hope that helps,
Mike Mayer - Visual C# MVP

majid said:
I want write a program with c# to pars a html file how ccan i do this with
system.mshtml? or there is other way to do it p;ease help me?
 
True, but the overhead involved with exception handling might not be worth
the effort. When loading a document, only the first error is identified.
If you have a standard web page, there may be hundereds of non-wellformed
tags. To process an exception for each one of the instances would be
cumbersome and expensive.

It might be better to invest in some library that makes the HTML content
well-formed such as Tidy (http://sourceforge.net/projects/ntidy/).
 
Embed a IE hidden browser control in your application.

It has a navigate method that takes an url and an event that fires
when the page has finished loading...

On finished loading, use the document object to get the dom. The
browser will have parsed it all for you...

50 lines of code...at most...parse any html the browser can
display...

Peter
 
SHDocVw.InternetExplorer IE = new SHDocVw.InternetExplorer();
IE.Visible = false;
object Dummy = System.Type.Missing;
IE.Navigate(http://www.google.com, ref Dummy, ref Dummy, ref Dummy, ref
Dummy);
' this makes it run around till the page is loaded
' (so a seperate thread might be a good idea)
while (IE.Busy && !IE.ReadyState.Equals("4"))
' it's loaded now, fire your event with IE.Document as data and do whatever
you want with it (including parsing)

Yves

pchapman said:
Embed a IE hidden browser control in your application.

It has a navigate method that takes an url and an event that fires
when the page has finished loading...

On finished loading, use the document object to get the dom. The
browser will have parsed it all for you...

50 lines of code...at most...parse any html the browser can
display...

Peter
 
Back
Top