c#.NET get text between body tags of an html file

R

rhitam

Hi all ,

I am trying to read an html file and retrieve only the text between
the body tags of that file. Now , for reading a string between two
strings , i already have a function :

http://www.mycsharpcorner.com/Post.aspx?postID=15

But the problem is that the body tag might have some attribute. In
that case i dont know how to exclude that and get only the text
between the tags. Ie , something like this :

<body style="margin:0;padding:0">
...
..
..
..
</body>

Any ideas?

Regards,
Rhitam
 
C

Cor Ligthert[MVP]

Be aware that it is almost impossible what you ask, because there is mostly
not only text between the body tags, but also images, flash, JavaScript etc.

But too get things between the body tags you need MSHTML (The namespace
around the DOM), it depends how you retrieve the page how you use that.

Cor
 
R

rhitam

Be aware that it is almost impossible what you ask, because there is mostly
not only text between the body tags, but also images, flash, JavaScript etc.

But too get things between the body tags you need MSHTML (The namespace
around the DOM), it depends how you retrieve the page how you use that.

Cor

All the html pages i need to parse are already located on the same
machine as the server. Actually i am trying to create a word document
using xml n xsl transform with c# .Now that part is done , and only
to provide a set of offline content , i have to append the html
contents of a set of webpages at the end of the document. Do u still
think MSHTML is the only way? I tried searching for htmlcontainerclass
but could not fine any useful code sample. Maybe someone could provide
some code sample? i will be using the c# code in a DLL which would be
called from classic asp.

-Rhitam
 
C

Cor Ligthert[MVP]

http://msdn.microsoft.com/en-us/library/bb498651(VS.85).aspx

Have a look at this one, the document is a mshtml document
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.document.aspx


Cor
Be aware that it is almost impossible what you ask, because there is
mostly
not only text between the body tags, but also images, flash, JavaScript
etc.

But too get things between the body tags you need MSHTML (The namespace
around the DOM), it depends how you retrieve the page how you use that.

Cor

All the html pages i need to parse are already located on the same
machine as the server. Actually i am trying to create a word document
using xml n xsl transform with c# .Now that part is done , and only
to provide a set of offline content , i have to append the html
contents of a set of webpages at the end of the document. Do u still
think MSHTML is the only way? I tried searching for htmlcontainerclass
but could not fine any useful code sample. Maybe someone could provide
some code sample? i will be using the c# code in a DLL which would be
called from classic asp.

-Rhitam
 
R

rhitam

http://msdn.microsoft.com/en-us/library/bb498651(VS.85).aspx

Have a look at this one, the document is a mshtml documenthttp://msdn.microsoft.com/en-us/library/system.windows.forms.webbrows...






All the html pages i need to parse are already located on the same
machine as the server. Actually i am trying to create a word document
using xml n xsl transform with c#  .Now that part is done , and only
to provide a set of offline content , i have to append the html
contents of a set of webpages at the end of the document. Do u still
think MSHTML is the only way? I tried searching for htmlcontainerclass
but could not fine any useful code sample. Maybe someone could provide
some code sample? i will be  using the c# code in a DLL which would be
called from classic asp.

-Rhitam

That was helpful.. but i am still a little stuck . I wrote the
following code in a simple c#.NET console application using Visual c#
express edition 2005 :


StreamReader TopLinkStream = new StreamReader(FilePath);
string TopLinkHtml = TopLinkStream.ReadToEnd();
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[] { TopLinkHtml }); // -- throws error here
HTMLDocumentClass domdoc = (HTMLDocumentClass)doc;
string BodyElem = domdoc.body.innerHTML;


the debugger throws error at the line indicated ie ,


doc.write(new object[] { TopLinkHtml });


At this point IE throws error saying 'Object expected' . Then i just
click on 'No' , and it proceeds to debug . Also the innerhtml is
loaded correctly in the 'doc' variable. How do i avoid that ?


Regards,

Rhitam
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top