HTML decoding in VB.Net

  • Thread starter Thread starter Greg Vereschagin
  • Start date Start date
G

Greg Vereschagin

I am writing a small application that downloads a web page into a
string variable. Is there a .Net class that allows the HTML to be
easily parsed and navigated. For example, one of things that I will
do is find the values of all the data elements in a table row.

Thank you in advance for any assistance.

Greg V
 
Hi Greg

You can include a reference to Microsoft.mshtml, which contains functions
that allow you to navigate the DOM. It's not for the faint-hearted though.

HTH

Charles
 
* Greg Vereschagin said:
I am writing a small application that downloads a web page into a
string variable. Is there a .Net class that allows the HTML to be
easily parsed and navigated.

No.
 
I've found that the 1.1 version of the Framework contains a "cross-site
scripting" module that, by default, does not allow certain characters to be
passed to the server. One of these characters is "<", and since all HTML
tags start with this, you'll have trouble unless you disable this feature.

I need to pass HTML to a database occasionally. To leave the security in
place but not get caught up with the problem, I substitute "<" with "[" and
">" with "]". I then just replace the characters on the server (once the
data has successfully arrived there).
 
Hi Greg,

As an additon to Charles,

With mshtml is the html.outertext almost you complete document with the
exception from the preceding doctype string, which is nowadays more and more
used.

I hope this helps,

Cor
 
Scott M. said:
I've found that the 1.1 version of the Framework contains a "cross-site
scripting" module that, by default, does not allow certain characters to be
passed to the server. One of these characters is "<", and since all HTML
tags start with this, you'll have trouble unless you disable this feature.

I need to pass HTML to a database occasionally. To leave the security in
place but not get caught up with the problem, I substitute "<" with "[" and
">" with "]". I then just replace the characters on the server (once the
data has successfully arrived there).


You want to check out HttpServerUtility.Encode / Decode for this
Scott.
It takes care of the whole kit kaboodal. It's also fantastic when your
persisting to an XML file.


hth
Richard

* May the Framework Be With You *
 
If you have any control over the webpage, then you could ask for it to
be implemented in Xml.

Otherwise REgEx it.
Other use that nasty bit of spazz, Cor was talking about.

hth
Richard
 
I am aware of Encode/Decode but I have found that the framework (1.1)
doesn't allow me to pass either "<" or "&lt;". Either way, the cross-site
scripting module balks.


spamfurnace said:
"Scott M." <[email protected]> wrote in message
I've found that the 1.1 version of the Framework contains a "cross-site
scripting" module that, by default, does not allow certain characters to be
passed to the server. One of these characters is "<", and since all HTML
tags start with this, you'll have trouble unless you disable this feature.

I need to pass HTML to a database occasionally. To leave the security in
place but not get caught up with the problem, I substitute "<" with "[" and
">" with "]". I then just replace the characters on the server (once the
data has successfully arrived there).


You want to check out HttpServerUtility.Encode / Decode for this
Scott.
It takes care of the whole kit kaboodal. It's also fantastic when your
persisting to an XML file.


hth
Richard

* May the Framework Be With You *
 
I am writing a small application that downloads a web page into a
string variable. Is there a .Net class that allows the HTML to be
easily parsed and navigated. For example, one of things that I will
do is find the values of all the data elements in a table row.

Thank you in advance for any assistance.

Greg V

I've just released v1.2 of my HTML 4 parser - it produces the DOM
you're after and can also be used to generate XHTML.

http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=2201&lngWId=10

Rgds,
 
Back
Top