reading text from .htm file

mp · Dec 17, 2010

I just made a quickie app to browse a folder(s) of .htm files
I put a webBrowser control on a form and a previous and next button, and a
save button.

my aim is to quickly browse through the files and if i find a page that has
useful info on it i want to save the "text information" to a text
file(ascii)- that's what the "save" button will be for.

I found this property webBrowser1.DocumentText which returns a string of the
html text of that page.
this string includes all the html tags etc...not just the written" text "
showing up on the page.

what i'm wondering is do i need to write a routine to parse the html tags to
detect what is the actuall "text" showing up on the page (as opposed to
attributes, formatting, etc)...or is there some kind of existing object that
can parse that text and return the actual "words" which appear on the page?

in other words if i look at a page in the browser, i could manually
copy/paste the "text" that shows up on the page.
is there a builtin way to do that programatically or do i need to create my
own html parsing routine?

thanks
mark

Arne Vajhøj · Dec 18, 2010

I just made a quickie app to browse a folder(s) of .htm files
I put a webBrowser control on a form and a previous and next button, and a
save button.

my aim is to quickly browse through the files and if i find a page that has
useful info on it i want to save the "text information" to a text
file(ascii)- that's what the "save" button will be for.

I found this property webBrowser1.DocumentText which returns a string of the
html text of that page.
this string includes all the html tags etc...not just the written" text "
showing up on the page.

what i'm wondering is do i need to write a routine to parse the html tags to
detect what is the actuall "text" showing up on the page (as opposed to
attributes, formatting, etc)...or is there some kind of existing object that
can parse that text and return the actual "words" which appear on the page?

in other words if i look at a page in the browser, i could manually
copy/paste the "text" that shows up on the page.
is there a builtin way to do that programatically or do i need to create my
own html parsing routine?

If you need to parse all types of pages and not some simple
standardized pages, then writing your own parse will be a lot
of work.

Many people have very bad experiences with the embedded browser
component.

Most people seems to like:
http://htmlagilitypack.codeplex.com/

Arne

mp · Dec 18, 2010

Peter Duniho said:
[...]
what i'm wondering is do i need to write a routine to parse the html tags
to
detect what is the actuall "text" showing up on the page (as opposed to
attributes, formatting, etc)...or is there some kind of existing object
that
can parse that text and return the actual "words" which appear on the
page?

Click to expand...

[...]
I would say that in general, you do not want to waste your time writing
your own HTML parser. If the HtmlDocument doesn't provide sufficient
capability for your needs, I would look at third-party libraries, such as
the Html Agility Pack.

Pete

thanks i'll check that out
mark

mp · Dec 18, 2010

Arne Vajhøj said:
I just made a quickie app to browse a folder(s) of .htm files
I put a webBrowser control on a form and a previous and next button, and
a
save button.

Click to expand...

[..]
If you need to parse all types of pages and not some simple
standardized pages, then writing your own parse will be a lot
of work.

Many people have very bad experiences with the embedded browser
component.

Most people seems to like:
http://htmlagilitypack.codeplex.com/

Arne

thanks, i'll chek that out
mark

WebBrowser and HTML capture (ignore previous)	2	Jun 30, 2005
Clearing a WebBrowser control	1	Jan 27, 2007
GetElementById	2	Feb 3, 2008
Adobe "Wrong type parameter supplied to a PDS procedure"	0	May 21, 2024
How to WebBrowser.DocumentText with right encoding	5	Jul 17, 2009
Colors in RichTextBox Control	1	Feb 24, 2009
WebBrowser.DocumentText being set stays on about:blank sometimes	2	Mar 23, 2007
WebBrowser DocumentText getting problem ...	7	Jun 5, 2007

reading text from .htm file

mp

Arne Vajhøj

mp

mp

Ask a Question

Similar Threads