Download HTML As Plain Text

  • Thread starter Thread starter Doominato
  • Start date Start date
D

Doominato

good day,

I was just wondering how can I download a web page as plain text from a
certain web site. I have tried to use the OpenURL() method from INET control
in my VB.NET app, but it returns elements such as this <BR> within the plain
text. Is there a way to filter them or to simply download the page as plain
text?

any help would be greatly appreciated.
 
Doominato said:
I was just wondering how can I download a web page as plain text from a
certain web site. I have tried to use the OpenURL() method from INET
control in my VB.NET app, but it returns elements such as this <BR>
within the plain text. Is there a way to filter them or to simply
download the page as plain text?

No. Web pages are not plain text, they are HTML. If you download it it, it
will always come in the format that it is, being HTML.

To have it as plain text you will need to convert it.


--
Chad Z. Hower (a.k.a. Kudzu) - http://www.hower.org/Kudzu/
"Programming is an art form that fights back"

Make your ASP.NET applications run faster
http://www.atozed.com/IntraWeb/
 
thanks for reply,

I realize that but I should have said that it is an HTML format but it
contains plain text (btw, this is the type of the page that i'm talking
about
http://www.wunderground.com/history/station/71624/2004/5/18/DailyHistory.html?format=1).
If you look at it and it's source you will see that they are pretty much
look the same except that source contains these tags sunch as <BR>, so the
question is how do I remove these tags and convert it to plain text???

thanks
 
That seems really stupid of weather underground not to actually provide a comma delimited file!but that junk. I'd be finding out if I couldn't find someone in their computer department to create real csv files (I don't know who's idea it was to do it like that)

Meanwhile, If you can get the entire document into a string, you can use a Replace(wholeDoc, "<BR>", vbCrLf), and then output that to a real csv file.

Also you've probalby noticed that there are no line breaks separating the actual data, which makes replacing those <br> with CRLF even more critical!

Good Luck!
--Michael
 
Hello,

I got an upper-hand on this and was able to clear out all the tags, so now I
got a clean CSV file.

Thank you so much for your help.
 
Hi Doominato

In addition to the others

In an HTML page you have always the property InnerText and OuterText.

The Innertext is between the tags, the Outertext including the tags.

HTML.outertext is almost forever a complete document including all tags and
whatever, however without the strange enough now more and more preceding
declaration line of a HTML page which is as far as I know unreachable using
the Document Object Model.

I hope this helps?

Cor
 
* "Doominato said:
I was just wondering how can I download a web page as plain text from a
certain web site. I have tried to use the OpenURL() method from INET control
in my VB.NET app, but it returns elements such as this <BR> within the plain
text. Is there a way to filter them or to simply download the page as plain
text?

Nice algorithm, implemented in VB6:

<URL:http://groups.google.com/groups?selm=ebXm3efoCHA.1976@TK2MSFTNGP10>
 
* "Cor Ligthert said:
Have a time a look at mshtml, this is very amateuristique in my opinion.

I know that it's possible with MSHTML, but Olaf's algorithm is in VB6
/very/ fast and often it's good enough. I am not sure if it will work
with the "shorttag" option and stuff like that enabled.
 
Back
Top