HTML parsing

  • Thread starter Thread starter Sunny
  • Start date Start date
S

Sunny

Hi all,
I have to get a HTML content of a given URL, to inspect links and images and
to change something in there, and to save the result. I have done it already
with:
System.Net.WebClient source = new System.Net.WebClient();

StreamReader mr = null;

try

{

mr = new StreamReader(source.OpenRead(sUrl));

sWebPage = mr.ReadToEnd();

}

catch

{

oParent.PagesDone++;

return;

}

finally

{

if (mr != null)

mr.Close();

}

After that I make a lot of IndexOfs' and Replaces to change the things I
want, and as it is sooooo ugly :( and slow.

So I decided to see if I can use MSHTML's IHTMLDocument2 interface, and just
to find and change values I need, but ... so far I couldn't imagine how to
place the string in the IHTMLDocument2 object and how to retrieve it from
there after that in order to save it.

Any clues will be highly appreciated.

Thanks

Sunny
 
Well, I'm no mshtml expert, but you might be able to get away with just
using IHTMLDocument4.createDocumentFromUrl to load the document directly
from the url into an IHTMLDocument class (any given html document class
should support IHTMLDocument through IHTMLDocument5) I'm not sure if it'll
work, but you can try.

If there is a mshtml specific newsgroup, you should be able to ask there as
the managed mshtml is just a PIA wrapper around the com objects. I'm sure
the talent pool in relation to mshtml would be far greater there.
 
Thanks Daniel,
but I'm a little bit confused, because all examples so far I have found are
using IHTMLDocument2 interface, so I'm worried about compatibility, i.e. on
what systems is available 5, and on what 2 ?
Also, you are talking about PIA, are there official PIAs, maybe I'm not
searching right, but I can not find any. I just made a reference to the
mshtml.dll, and VS .net had made some interop assembly, but because MS are
providing official PIAs for some products (I know for sure there are for
Office XP), I wondered are there any for mshtml.
I'll try to search again, and also to post this question in other groups
also, but one little :) push will be very helpful :)

Thanks again
Sunny

P.S. If there is any other way to solve may problem - I'm open to any
suggestion :)
 
Well, I can't remember if there is a mshtml interop dll for vs.net 2002, but
I am very sure it is there for vs.net 2003. In your add reference dialog,
scroll down and look for Microsoft.mshtml.dll, it should be there.
Likewise for borland C# builder, there is a Borland.mshtml.dll included in
the install folder (just incase you or anyone who reads this is curious).

Anyway, as for the IHTMLDocument interfaces, use of:
IHTMLDocument and IHTMLDocument2 requires IE 4.
IHTMLDocument3 requires IE 5.
IHTMLDocument4 requires IE 5.5.
IHTMLDocument5 requires IE 6.
I consider it a pretty safe bet that IE 5.5+ is installed, and 6 has been
out a while and windows update pushes it.

IHTMLDocument2 provides most of the functionality you need, hence it gets
most of the exposure in examples. Due to COM issues, each individual
interface is seperate. As such, you will need both a IHTMLDocument2 & a
IHTMLDocument4 typed reference to your document object. It is a pain but it
is part of the price of COM interop.
So, something like

IHTMLDocument2 myDocument = <get your document>;
IHTMLDocument4 myDocument4 = (IHTMLDocument4)myDocument;

should do the job of giving you both typed references.
 
Hi Daniel,
I have solved the problem. I'll post the solution in case someone needs it
(it was little bit hard for me to find it).

So, I am reading the URL with the code already posted in the beginning.
Now in sWebPage I have the text content of the page. This text is placed in
a HTMLDocumentClass object as follows:

HTMLDocumentClass myDoc;

try

{

object[] oPageText = {sWebPage};

myDoc = new HTMLDocumentClass();

IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;

oMyDoc.write(oPageText);

}

catch

{

//handle

}

And now you may do to the document whatever you want. And after all the HTML
text is in myDoc.documenyElement.outerHTML.

This link was very useful in solving that problem (Thanks Alex):

http://www.csharphelp.com/archives/archive146.html

Sunny
 
Ahh, much cleaner solution in your particular case. Hopefully the next time
someone asks this question (and someone will), I may actually remember the
solution you found.
Sunny said:
Hi Daniel,
I have solved the problem. I'll post the solution in case someone needs it
(it was little bit hard for me to find it).

So, I am reading the URL with the code already posted in the beginning.
Now in sWebPage I have the text content of the page. This text is placed in
a HTMLDocumentClass object as follows:

HTMLDocumentClass myDoc;

try

{

object[] oPageText = {sWebPage};

myDoc = new HTMLDocumentClass();

IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;

oMyDoc.write(oPageText);

}

catch

{

//handle

}

And now you may do to the document whatever you want. And after all the HTML
text is in myDoc.documenyElement.outerHTML.

This link was very useful in solving that problem (Thanks Alex):

http://www.csharphelp.com/archives/archive146.html

Sunny


Daniel O'Connell said:
Well, I can't remember if there is a mshtml interop dll for vs.net 2002, but
I am very sure it is there for vs.net 2003. In your add reference dialog,
scroll down and look for Microsoft.mshtml.dll, it should be there.
Likewise for borland C# builder, there is a Borland.mshtml.dll included in
the install folder (just incase you or anyone who reads this is curious).

Anyway, as for the IHTMLDocument interfaces, use of:
IHTMLDocument and IHTMLDocument2 requires IE 4.
IHTMLDocument3 requires IE 5.
IHTMLDocument4 requires IE 5.5.
IHTMLDocument5 requires IE 6.
I consider it a pretty safe bet that IE 5.5+ is installed, and 6 has been
out a while and windows update pushes it.

IHTMLDocument2 provides most of the functionality you need, hence it gets
most of the exposure in examples. Due to COM issues, each individual
interface is seperate. As such, you will need both a IHTMLDocument2 & a
IHTMLDocument4 typed reference to your document object. It is a pain but it
is part of the price of COM interop.
So, something like

IHTMLDocument2 myDocument = <get your document>;
IHTMLDocument4 myDocument4 = (IHTMLDocument4)myDocument;

should do the job of giving you both typed references.

found
are i.e.
on things interface,
and imagine
how
 
Back
Top