HTML Parser

  • Thread starter Thread starter wilk
  • Start date Start date
W

wilk

Is anybody know here any class in .NET that would help me to parse html in
C# ?
Or maybe you can even tell me how to do it?
 
I would think you could parse HTML as XHTML and use the System.XML classes
to do it.
 
Is anybody know here any class in .NET that would help me to parse
html in C# ?
Or maybe you can even tell me how to do it?

Do you mean encode it, so that it is not executeable... IE. filter users
input before adding it to a message board post? If so, follow this link...

http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/dv_vbcode/html/vbtskCodeEncodingHTMLTextVisualBasic.asp

To parse / tokenize html tags from an HTMl document you may have to make
your own components? to get started I found:

System.Web.UI.BaseParser()
"Provides a base set of functionality for classes involved in parsing
ASP.NET page requests and server controls."

http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/cpref/html/frlrfsystemwebuibaseparserclasstopic.asp

try looking on SourceForge.net, and searching for "Html parser". All the
ones I saw are in Java, but looking at the code for these may help you
write your own in C#.

Also you may want to keep checking the feature lists for ASP.NET 2.0
(whidbey). That may be one of the new features? Post your request on the
forums of this site, and maybe MS will publically expose it's own parser in
2.0?

http://www.asp.net/whidbey/

Michael Lang, MCSD
 
U¿ytkownik "Michael Lang said:
Do you mean encode it, so that it is not executeable... IE. filter users
input before adding it to a message board post? If so, follow this link...

http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/dv_vbcode/html/vbtskCodeEncodingHTMLTextVisualBasic.asp
No, but thanks.
To parse / tokenize html tags from an HTMl document you may have to make
your own components? to get started I found:
YES , yes I need parse tags.
System.Web.UI.BaseParser()
"Provides a base set of functionality for classes involved in parsing
ASP.NET page requests and server controls."

http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/cpref/html/frlrfsystemwebuibaseparserclasstopic.asp

try looking on SourceForge.net, and searching for "Html parser". All the
ones I saw are in Java, but looking at the code for these may help you
write your own in C#.

Also you may want to keep checking the feature lists for ASP.NET 2.0
(whidbey). That may be one of the new features? Post your request on the
forums of this site, and maybe MS will publically expose it's own parser in
2.0?

http://www.asp.net/whidbey/

Michael Lang, MCSD

Thank You very , very much for this useful and helpful answer.
 
wilk said:
I must take all URLs from html file.

SgmlReader sr = new SgmlReader(new TextReader(my_html_file));
XmlDocument xdoc = new XmlDocument();
xdoc.Load(sr);
foreach(XmlNode n in xdoc.SelectNodes("//a"))
{
XmlNode = xdoc.CreateText(n.InnerText);
n.ParentNode.Remove(n);
}

//Didn't test it, but it's the way to go.
 
Back
Top