HTML links

Aristotelis Pitaridis · Jul 3, 2006

I am trying to extract all the links and the image URLs from an HTML file. I
tried to read byte - byte the information in order to detect the URLs but it
did not work because some JavaScript or other information in the HTML file
caused problems. Is there any class which works ok with this kind of data
extraction?

Aristotelis

Jared Parsons [MSFT] · Jul 3, 2006

Hello Aristotelis,

I am trying to extract all the links and the image URLs from an HTML
file. I tried to read byte - byte the information in order to detect
the URLs but it did not work because some JavaScript or other
information in the HTML file caused problems. Is there any class which
works ok with this kind of data extraction?

You could load the page into System.Windows.Forms.HtmlDocument. It has a
GetElementsByTagName method that you can use to get all of the links.

Aristotelis Pitaridis · Jul 3, 2006

I tried it but the System.Windows.Forms.HtmlDocument object does not have a
constructor. How can I set the URL of the page in order to collect the
various information?

Aristotelis

? "Jared Parsons said:
Hello Aristotelis,

I am trying to extract all the links and the image URLs from an HTML
file. I tried to read byte - byte the information in order to detect
the URLs but it did not work because some JavaScript or other
information in the HTML file caused problems. Is there any class which
works ok with this kind of data extraction?

Click to expand...

You could load the page into System.Windows.Forms.HtmlDocument. It has a
GetElementsByTagName method that you can use to get all of the links.

--
Jared Parsons [MSFT]
(e-mail address removed)
All opinions are my own. All content is provided "AS IS" with no
warranties, and confers no rights.

Jared Parsons [MSFT] · Jul 3, 2006

Hello Aristotelis,

I tried it but the System.Windows.Forms.HtmlDocument object does not
have a constructor. How can I set the URL of the page in order to
collect the various information?

It looks like you'll have to create an instance of the WebBrowser control.
That will give you access to the underlying HtmlDocument which you can then
query.

Cor Ligthert [MVP] · Jul 3, 2006

Aristotolis,

They (we and others) use those JavaScript to prevent things as spamming,
including me. You want that we give you a method to overcome that and put
that on this board. Even if I did know it than was the answer. No way.

:-)

Cor

Aristotelis Pitaridis · Jul 4, 2006

I think that there will be a problem with the javascripts. If I load a page
which contains a Javascript Alert message box, this will have as a result to
stop the whole prosess, and the user will see this window on the screen. Is
there a way to disable the javascript execution for a WebBrowser control?

Aristotelis

? "Jared Parsons said:
Hello Aristotelis,

I tried it but the System.Windows.Forms.HtmlDocument object does not
have a constructor. How can I set the URL of the page in order to
collect the various information?

Click to expand...

It looks like you'll have to create an instance of the WebBrowser control.
That will give you access to the underlying HtmlDocument which you can
then query.
--
Jared Parsons [MSFT]
(e-mail address removed)
All opinions are my own. All content is provided "AS IS" with no
warranties, and confers no rights.

intolerance · Jul 4, 2006

http://www.regular-expressions.net/examples.html

This has a great tutorial about grabbing html tags.

-Allen

Herfried K. Wagner [MVP] · Jul 4, 2006

Aristotelis Pitaridis said:
I am trying to extract all the links and the image URLs from an HTML file.
I tried to read byte - byte the information in order to detect the URLs but
it did not work because some JavaScript or other information in the HTML
file caused problems. Is there any class which works ok with this kind of
data extraction?

I suggest to use an HTML parser instead of regular expressions for this
purpose:

Parsing an HTML file:

MSHTML Reference
<URL:http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/reference.asp>

- or -

..NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>

Download:

<URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip>

- or -

SgmlReader 1.4
<URL:http://www.gotdotnet.com/Community/...mpleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC>

If the file read is in XHTML format, you can use the classes contained in
the 'System.Xml' namespace for reading information from the file.

HTML links

Aristotelis Pitaridis

Jared Parsons [MSFT]

Aristotelis Pitaridis

Jared Parsons [MSFT]

Cor Ligthert [MVP]

Aristotelis Pitaridis

intolerance

Herfried K. Wagner [MVP]