Help with regular expression?

  • Thread starter Thread starter Bradley Plett
  • Start date Start date
B

Bradley Plett

I'm hopeless at regular expressions (I just don't use them often
enough to gain/maintain knowledge), but I need one now and am looking
for help. I need to parse through a document to find a URL, and then
reconstruct another URL based on it. For example, I need to scan a
web page looking for something like <a
href="some_dir/list_20050815100225.csv">. I don't know in advance
what the date/time in the file name will be. I need to take the
result of that and construct a URL out of it so that I can automate
the download of this file on a regular basis. The replace can be done
by replacing "<token>" in
"http://www.whatever.com/some_dir/list_<token>" with the result from
above. However, I would like the directory information included in
the search result so that I don't have to hard-code it (i.e. I'd
rather look for a URL with "list_<datetime>.csv" in it).

I have a regular expression that comes close:
"href=""some_dir/list_(?:(?<1>[^""]*)""|(?<1>\S+))". I got that by
tweaking the example at
http://msdn.microsoft.com/library/d...uide/html/cpconexamplechangingdateformats.asp.
If I can't find a cleaner sample, that will have to do. However,
there are two minor problems with this expression: 1) I would rather
be returning the complete URL in the href (to make it easier to
capture variable subdirectories, for example), and 2) it would require
a two-step process (the match followed by the replace). Is it
possible have a single regular expression to do both? That would
simplify configuration of my program, since the intent is that none of
this be hard-coded.

Any help would be appreciated.

Thanks!
Brad.

P.S. If there's a better place to post this kind of question, I'd
love to hear about it. I was tempted to cross-post, but.... :-)
 
This Regex string will work for identifying URLS:

(http|https|mailto):([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])+#*([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])
 
The example that I cited is actually closer to what I need, but
thanks!

Brad.

This Regex string will work for identifying URLS:

(http|https|mailto):([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])+#*([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])



Bradley Plett said:
I'm hopeless at regular expressions (I just don't use them often
enough to gain/maintain knowledge), but I need one now and am looking
for help. I need to parse through a document to find a URL, and then
reconstruct another URL based on it. For example, I need to scan a
web page looking for something like <a
href="some_dir/list_20050815100225.csv">. I don't know in advance
what the date/time in the file name will be. I need to take the
result of that and construct a URL out of it so that I can automate
the download of this file on a regular basis. The replace can be done
by replacing "<token>" in
"http://www.whatever.com/some_dir/list_<token>" with the result from
above. However, I would like the directory information included in
the search result so that I don't have to hard-code it (i.e. I'd
rather look for a URL with "list_<datetime>.csv" in it).

I have a regular expression that comes close:
"href=""some_dir/list_(?:(?<1>[^""]*)""|(?<1>\S+))". I got that by
tweaking the example at
http://msdn.microsoft.com/library/d...uide/html/cpconexamplechangingdateformats.asp.
If I can't find a cleaner sample, that will have to do. However,
there are two minor problems with this expression: 1) I would rather
be returning the complete URL in the href (to make it easier to
capture variable subdirectories, for example), and 2) it would require
a two-step process (the match followed by the replace). Is it
possible have a single regular expression to do both? That would
simplify configuration of my program, since the intent is that none of
this be hard-coded.

Any help would be appreciated.

Thanks!
Brad.

P.S. If there's a better place to post this kind of question, I'd
love to hear about it. I was tempted to cross-post, but.... :-)
 
I'll put this c#.

Regex regex = new Regex("href=\\\"(?'url'some_dir\\/list_[^\\\"]*)\\\""
, RegexOptions.IgnoreCase | RegexOptions.Singleline |
RegexOptions.ExplicitCapture);
string form ="<a href=\"some_dir/list_20050815100225.csv\">";
Match match = regex.Match( form );

if (match.Success)
{
Console.WriteLine("success: " + "http://www.whatever.com/" +
match.Groups["url"].Value);
}
else
{
Console.WriteLine("failed.");
}

and gets this result

success: http://www.whatever.com/some_dir/list_20050815100225.csv


Bruce Dunwiddie
www.csvreader.com


Paul said:
This Regex string will work for identifying URLS:

(http|https|mailto):([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])+#*([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])



Bradley Plett said:
I'm hopeless at regular expressions (I just don't use them often
enough to gain/maintain knowledge), but I need one now and am looking
for help. I need to parse through a document to find a URL, and then
reconstruct another URL based on it. For example, I need to scan a
web page looking for something like <a
href="some_dir/list_20050815100225.csv">. I don't know in advance
what the date/time in the file name will be. I need to take the
result of that and construct a URL out of it so that I can automate
the download of this file on a regular basis. The replace can be done
by replacing "<token>" in
"http://www.whatever.com/some_dir/list_<token>" with the result from
above. However, I would like the directory information included in
the search result so that I don't have to hard-code it (i.e. I'd
rather look for a URL with "list_<datetime>.csv" in it).

I have a regular expression that comes close:
"href=""some_dir/list_(?:(?<1>[^""]*)""|(?<1>\S+))". I got that by
tweaking the example at
http://msdn.microsoft.com/library/d...uide/html/cpconexamplechangingdateformats.asp.
If I can't find a cleaner sample, that will have to do. However,
there are two minor problems with this expression: 1) I would rather
be returning the complete URL in the href (to make it easier to
capture variable subdirectories, for example), and 2) it would require
a two-step process (the match followed by the replace). Is it
possible have a single regular expression to do both? That would
simplify configuration of my program, since the intent is that none of
this be hard-coded.

Any help would be appreciated.

Thanks!
Brad.

P.S. If there's a better place to post this kind of question, I'd
love to hear about it. I was tempted to cross-post, but.... :-)
 
Yes, if I tweak the regular expression you provided just slightly (by
replacing "'url'some_dir" with "'url'[^\\\"]*", that works well and
includes the directory information even if it changes. Now it would
be nice if I could include the ["http://www.whatever.com/" +
match.Groups["url"].Value] in the same regular expression, but that
may be asking too much! :-)

Thanks!
Brad.

I'll put this c#.

Regex regex = new Regex("href=\\\"(?'url'some_dir\\/list_[^\\\"]*)\\\""
, RegexOptions.IgnoreCase | RegexOptions.Singleline |
RegexOptions.ExplicitCapture);
string form ="<a href=\"some_dir/list_20050815100225.csv\">";
Match match = regex.Match( form );

if (match.Success)
{
Console.WriteLine("success: " + "http://www.whatever.com/" +
match.Groups["url"].Value);
}
else
{
Console.WriteLine("failed.");
}

and gets this result

success: http://www.whatever.com/some_dir/list_20050815100225.csv


Bruce Dunwiddie
www.csvreader.com


Paul said:
This Regex string will work for identifying URLS:

(http|https|mailto):([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])+#*([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])



Bradley Plett said:
I'm hopeless at regular expressions (I just don't use them often
enough to gain/maintain knowledge), but I need one now and am looking
for help. I need to parse through a document to find a URL, and then
reconstruct another URL based on it. For example, I need to scan a
web page looking for something like <a
href="some_dir/list_20050815100225.csv">. I don't know in advance
what the date/time in the file name will be. I need to take the
result of that and construct a URL out of it so that I can automate
the download of this file on a regular basis. The replace can be done
by replacing "<token>" in
"http://www.whatever.com/some_dir/list_<token>" with the result from
above. However, I would like the directory information included in
the search result so that I don't have to hard-code it (i.e. I'd
rather look for a URL with "list_<datetime>.csv" in it).

I have a regular expression that comes close:
"href=""some_dir/list_(?:(?<1>[^""]*)""|(?<1>\S+))". I got that by
tweaking the example at
http://msdn.microsoft.com/library/d...uide/html/cpconexamplechangingdateformats.asp.
If I can't find a cleaner sample, that will have to do. However,
there are two minor problems with this expression: 1) I would rather
be returning the complete URL in the href (to make it easier to
capture variable subdirectories, for example), and 2) it would require
a two-step process (the match followed by the replace). Is it
possible have a single regular expression to do both? That would
simplify configuration of my program, since the intent is that none of
this be hard-coded.

Any help would be appreciated.

Thanks!
Brad.

P.S. If there's a better place to post this kind of question, I'd
love to hear about it. I was tempted to cross-post, but.... :-)
 
Hi Bradley,

As far as I know, the regular expression can only do matching in a string.
It cannot concatenate strings. So I think you have to do the string
operations in the C# code. HTH.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."
 
Back
Top