c#.NET get text between body tags of an html file

rhitam · May 5, 2009

Hi all ,

I am trying to read an html file and retrieve only the text between
the body tags of that file. Now , for reading a string between two
strings , i already have a function :

http://www.mycsharpcorner.com/Post.aspx?postID=15

But the problem is that the body tag might have some attribute. In
that case i dont know how to exclude that and get only the text
between the tags. Ie , something like this :

<body style="margin:0;padding:0">
...
..
..
..
</body>

Any ideas?

Regards,
Rhitam

Cor Ligthert[MVP] · May 5, 2009

Be aware that it is almost impossible what you ask, because there is mostly
not only text between the body tags, but also images, flash, JavaScript etc.

But too get things between the body tags you need MSHTML (The namespace
around the DOM), it depends how you retrieve the page how you use that.

Cor

rhitam · May 5, 2009

Be aware that it is almost impossible what you ask, because there is mostly
not only text between the body tags, but also images, flash, JavaScript etc.

But too get things between the body tags you need MSHTML (The namespace
around the DOM), it depends how you retrieve the page how you use that.

Cor

All the html pages i need to parse are already located on the same
machine as the server. Actually i am trying to create a word document
using xml n xsl transform with c# .Now that part is done , and only
to provide a set of offline content , i have to append the html
contents of a set of webpages at the end of the document. Do u still
think MSHTML is the only way? I tried searching for htmlcontainerclass
but could not fine any useful code sample. Maybe someone could provide
some code sample? i will be using the c# code in a DLL which would be
called from classic asp.

-Rhitam

Cor Ligthert[MVP] · May 5, 2009

http://msdn.microsoft.com/en-us/library/bb498651(VS.85).aspx

Have a look at this one, the document is a mshtml document
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.document.aspx

Cor

Be aware that it is almost impossible what you ask, because there is
mostly
not only text between the body tags, but also images, flash, JavaScript
etc.

But too get things between the body tags you need MSHTML (The namespace
around the DOM), it depends how you retrieve the page how you use that.

Cor

All the html pages i need to parse are already located on the same
machine as the server. Actually i am trying to create a word document
using xml n xsl transform with c# .Now that part is done , and only
to provide a set of offline content , i have to append the html
contents of a set of webpages at the end of the document. Do u still
think MSHTML is the only way? I tried searching for htmlcontainerclass
but could not fine any useful code sample. Maybe someone could provide
some code sample? i will be using the c# code in a DLL which would be
called from classic asp.

-Rhitam

rhitam · May 5, 2009

http://msdn.microsoft.com/en-us/library/bb498651(VS.85).aspx

Have a look at this one, the document is a mshtml documenthttp://msdn.microsoft.com/en-us/library/system.windows.forms.webbrows...

All the html pages i need to parse are already located on the same
machine as the server. Actually i am trying to create a word document
using xml n xsl transform with c# .Now that part is done , and only
to provide a set of offline content , i have to append the html
contents of a set of webpages at the end of the document. Do u still
think MSHTML is the only way? I tried searching for htmlcontainerclass
but could not fine any useful code sample. Maybe someone could provide
some code sample? i will be using the c# code in a DLL which would be
called from classic asp.

-Rhitam

That was helpful.. but i am still a little stuck . I wrote the
following code in a simple c#.NET console application using Visual c#
express edition 2005 :

StreamReader TopLinkStream = new StreamReader(FilePath);
string TopLinkHtml = TopLinkStream.ReadToEnd();
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[] { TopLinkHtml }); // -- throws error here
HTMLDocumentClass domdoc = (HTMLDocumentClass)doc;
string BodyElem = domdoc.body.innerHTML;

the debugger throws error at the line indicated ie ,

doc.write(new object[] { TopLinkHtml });

At this point IE throws error saying 'Object expected' . Then i just
click on 'No' , and it proceeds to debug . Also the innerhtml is
loaded correctly in the 'doc' variable. How do i avoid that ?

Regards,

Rhitam

Hello...This thread is a continuation of one from 2003...pasted below:	2	Nov 19, 2017
The table HTML element	2	May 6, 2011
Extract specific information from the body of outlook mail to an Excel File using VBA	0	Apr 12, 2017
write a C# code inside a javascript code, before body tag	1	Oct 19, 2006
reading text from .htm file	3	Dec 17, 2010
webbrowser.navigate	1	Apr 19, 2009
Syntax for regular expression to highlight text in HTML string	2	Sep 22, 2005
c# .net write html to word special characters not writing	14	May 21, 2009

c#.NET get text between body tags of an html file

rhitam

Cor Ligthert[MVP]

rhitam

Cor Ligthert[MVP]

rhitam

Ask a Question

Similar Threads