Find all href from a html file

  • Thread starter Thread starter Hemant
  • Start date Start date
H

Hemant

Hi,
I want to find all href from anchor tag in a html file .
I read the file in string but I am not getting how to get url from href of
anchor tab.
I have to get all the url from anchor tag .
thanks ,
Hemant
 
Hi,
I want to find all href from anchor tag in a html file .
I read the file in string but I am not getting how to get url from href of
anchor tab.
I have to get all the url from anchor tag .
thanks ,
Hemant

Use Regular Expressions.

If you link is

<a name="label">Any content</a>

and you need to get "label", use

using System.Text.RegularExpressions;

Regex regex = new Regex(
@"(?<=name="").*?(?="")",
RegexOptions.IgnoreCase
| RegexOptions.Multiline
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);

if you need to get a named anchor from the url like

<a href="http://www.site.com/page.htm#tips">Jump to Tips</a>

then use following

Regex regex = new Regex(
@"[#].*?(?="")",
RegexOptions.IgnoreCase
| RegexOptions.Multiline
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
 
Hi ,
Thanks for your reply.
As you said I have done this .
System.IO.StreamReader rdr = new System.IO.StreamReader("c:\\test.html");

string inputString = "";

inputString = rdr.ReadToEnd();

Regex regex = new Regex(

@"(?<=href="").*?(?="")",

RegexOptions.IgnoreCase

| RegexOptions.Multiline

| RegexOptions.IgnorePatternWhitespace

| RegexOptions.Compiled

);

MatchCollection col = regex.Matches(inputString);

foreach (Match match in col)

{

Console.WriteLine("href = " + match.Groups["href"].Value);

}

Console.ReadLine();

but I am getting output only href = ""

I want to get what is in href ?

am I wrong anywhere ?

please suggest me.

thanks ,

hemant

Hi,
I want to find all href from anchor tag in a html file .
I read the file in string but I am not getting how to get url from href of
anchor tab.
I have to get all the url from anchor tag .
thanks ,
Hemant

Use Regular Expressions.

If you link is

<a name="label">Any content</a>

and you need to get "label", use

using System.Text.RegularExpressions;

Regex regex = new Regex(
@"(?<=name="").*?(?="")",
RegexOptions.IgnoreCase
| RegexOptions.Multiline
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);

if you need to get a named anchor from the url like

<a href="http://www.site.com/page.htm#tips">Jump to Tips</a>

then use following

Regex regex = new Regex(
@"[#].*?(?="")",
RegexOptions.IgnoreCase
| RegexOptions.Multiline
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
 
Hi ,
Thanks for your reply.
As you said I have done this .
System.IO.StreamReader rdr = new System.IO.StreamReader("c:\\test.html");

string inputString = "";

inputString = rdr.ReadToEnd();

Regex regex = new Regex(

@"(?<=href="").*?(?="")",

RegexOptions.IgnoreCase

| RegexOptions.Multiline

| RegexOptions.IgnorePatternWhitespace

| RegexOptions.Compiled

);

MatchCollection col = regex.Matches(inputString);

foreach (Match match in col)

{

Console.WriteLine("href = " + match.Groups["href"].Value);

}

Hi Hemant,

this is because you check the value of a group named
"href" (match.Groups["href"].Value) but you don't have such group in
your search pattern.

Eiter use

Console.WriteLine("href = " + match.Value);

or change your pattern to

@"(?<=href="")(?<href>.*?)(?="")"

where ?<href> will create a named group

Hope this helps
 
thanks Alexey.

Hi ,
Thanks for your reply.
As you said I have done this .
System.IO.StreamReader rdr = new System.IO.StreamReader("c:\\test.html");

string inputString = "";

inputString = rdr.ReadToEnd();

Regex regex = new Regex(

@"(?<=href="").*?(?="")",

RegexOptions.IgnoreCase

| RegexOptions.Multiline

| RegexOptions.IgnorePatternWhitespace

| RegexOptions.Compiled

);

MatchCollection col = regex.Matches(inputString);

foreach (Match match in col)

{

Console.WriteLine("href = " + match.Groups["href"].Value);

}

Hi Hemant,

this is because you check the value of a group named
"href" (match.Groups["href"].Value) but you don't have such group in
your search pattern.

Eiter use

Console.WriteLine("href = " + match.Value);

or change your pattern to

@"(?<=href="")(?<href>.*?)(?="")"

where ?<href> will create a named group

Hope this helps
 
Hi,
thanks Alexey
there is one more question i want to ask.
I have solved my problem with regex but if i want to use .net 3.5 than is
there any another method by which i can solve this problem .
thanks,
Hemant
Hemant said:
thanks Alexey.

Hi ,
Thanks for your reply.
As you said I have done this .
System.IO.StreamReader rdr = new System.IO.StreamReader("c:\\test.html");

string inputString = "";

inputString = rdr.ReadToEnd();

Regex regex = new Regex(

@"(?<=href="").*?(?="")",

RegexOptions.IgnoreCase

| RegexOptions.Multiline

| RegexOptions.IgnorePatternWhitespace

| RegexOptions.Compiled

);

MatchCollection col = regex.Matches(inputString);

foreach (Match match in col)

{

Console.WriteLine("href = " + match.Groups["href"].Value);

}

Hi Hemant,

this is because you check the value of a group named
"href" (match.Groups["href"].Value) but you don't have such group in
your search pattern.

Eiter use

Console.WriteLine("href = " + match.Value);

or change your pattern to

@"(?<=href="")(?<href>.*?)(?="")"

where ?<href> will create a named group

Hope this helps
 
Hi,
thanks Alexey
there is one more question i want to ask.
I have solved my problem with regex but if i want to use .net 3.5 than is
there any another method by which i can solve this problem .
thanks,

Why would you need another method in .net 3.5? This approach should
work
 
hi,
you are right .
I have done my work with regex.
I want to learn asp.net 3.5 and want to use new feature of 3.5 thats why i
am asking the same.
Thanks ,
Hemant
Hi,
thanks Alexey
there is one more question i want to ask.
I have solved my problem with regex but if i want to use .net 3.5 than is
there any another method by which i can solve this problem .
thanks,

Why would you need another method in .net 3.5? This approach should
work
 
hi,
you are right .
I have done my work with regex.
I want to learn asp.net 3.5 and want to use new feature of 3.5 thats why i
am asking the same.
Thanks ,



Why would you need another method in .net 3.5? This approach should
work

To get a difference between versions of ASP.NET, please take a look
here:

http://msdn.microsoft.com/en-us/library/s57a598e.aspx
http://aspnet.4guysfromrolla.com/articles/112107-1.aspx

The most significant advances in .NET 3.5 are improved support for
developing AJAX-enabled Web sites and support for Language-Integrated
Query (LINQ).
 
Hi,
Thnaks for your reply.
ok than can I use LINQ for the same problem?
thanks,
Hemant
hi,
you are right .
I have done my work with regex.
I want to learn asp.net 3.5 and want to use new feature of 3.5 thats why i
am asking the same.
Thanks ,



Why would you need another method in .net 3.5? This approach should
work

To get a difference between versions of ASP.NET, please take a look
here:

http://msdn.microsoft.com/en-us/library/s57a598e.aspx
http://aspnet.4guysfromrolla.com/articles/112107-1.aspx

The most significant advances in .NET 3.5 are improved support for
developing AJAX-enabled Web sites and support for Language-Integrated
Query (LINQ).
 
Hi,
Thnaks for your reply.
ok than can I use LINQ for the same problem?
thanks,





To get a difference between versions of ASP.NET, please take a look
here:

http://msdn.microsoft.com/en-us/lib...net.4guysfromrolla.com/articles/112107-1.aspx

The most significant advances in .NET 3.5 are improved support for
developing AJAX-enabled Web sites and support for Language-Integrated
Query (LINQ).- Hide quoted text -

- Show quoted text -

Try to make something like this

HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(@"<html><head><title>...</body></html>"));
HtmlNode root = doc.DocumentNode;
HtmlNodeCollection links = root.SelectNodes("//a[@href['.txt' =
substring(., string-length(.)- 3)]]");
IList<string> fileStrings;
if(links != null) {
fileStrings = new List<string>(links.Count);
foreach(HtmlNode link in links)
fileStrings.Add(link.GetAttributeValue("href", null));
} else
fileStrings = new List<string>(0);

Source: http://stackoverflow.com/questions/907563/parsing-html-document-regular-expression-or-linq
 
Back
Top