HTML links

  • Thread starter Thread starter Aristotelis Pitaridis
  • Start date Start date
A

Aristotelis Pitaridis

I am trying to extract all the links and the image URLs from an HTML file. I
tried to read byte - byte the information in order to detect the URLs but it
did not work because some JavaScript or other information in the HTML file
caused problems. Is there any class which works ok with this kind of data
extraction?



Aristotelis
 
Hello Aristotelis,
I am trying to extract all the links and the image URLs from an HTML
file. I tried to read byte - byte the information in order to detect
the URLs but it did not work because some JavaScript or other
information in the HTML file caused problems. Is there any class which
works ok with this kind of data extraction?

You could load the page into System.Windows.Forms.HtmlDocument. It has a
GetElementsByTagName method that you can use to get all of the links.
 
I tried it but the System.Windows.Forms.HtmlDocument object does not have a
constructor. How can I set the URL of the page in order to collect the
various information?



Aristotelis



? "Jared Parsons said:
Hello Aristotelis,
I am trying to extract all the links and the image URLs from an HTML
file. I tried to read byte - byte the information in order to detect
the URLs but it did not work because some JavaScript or other
information in the HTML file caused problems. Is there any class which
works ok with this kind of data extraction?

You could load the page into System.Windows.Forms.HtmlDocument. It has a
GetElementsByTagName method that you can use to get all of the links.

--
Jared Parsons [MSFT]
(e-mail address removed)
All opinions are my own. All content is provided "AS IS" with no
warranties, and confers no rights.
 
Hello Aristotelis,
I tried it but the System.Windows.Forms.HtmlDocument object does not
have a constructor. How can I set the URL of the page in order to
collect the various information?

It looks like you'll have to create an instance of the WebBrowser control.
That will give you access to the underlying HtmlDocument which you can then
query.
 
Aristotolis,

They (we and others) use those JavaScript to prevent things as spamming,
including me. You want that we give you a method to overcome that and put
that on this board. Even if I did know it than was the answer. No way.

:-)

Cor
 
I think that there will be a problem with the javascripts. If I load a page
which contains a Javascript Alert message box, this will have as a result to
stop the whole prosess, and the user will see this window on the screen. Is
there a way to disable the javascript execution for a WebBrowser control?

Aristotelis

? "Jared Parsons said:
Hello Aristotelis,
I tried it but the System.Windows.Forms.HtmlDocument object does not
have a constructor. How can I set the URL of the page in order to
collect the various information?

It looks like you'll have to create an instance of the WebBrowser control.
That will give you access to the underlying HtmlDocument which you can
then query.
--
Jared Parsons [MSFT]
(e-mail address removed)
All opinions are my own. All content is provided "AS IS" with no
warranties, and confers no rights.
 
Aristotelis Pitaridis said:
I am trying to extract all the links and the image URLs from an HTML file.
I tried to read byte - byte the information in order to detect the URLs but
it did not work because some JavaScript or other information in the HTML
file caused problems. Is there any class which works ok with this kind of
data extraction?

I suggest to use an HTML parser instead of regular expressions for this
purpose:

Parsing an HTML file:

MSHTML Reference
<URL:http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/reference.asp>

- or -

..NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>

Download:

<URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip>

- or -

SgmlReader 1.4
<URL:http://www.gotdotnet.com/Community/...mpleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC>

If the file read is in XHTML format, you can use the classes contained in
the 'System.Xml' namespace for reading information from the file.
 
Back
Top