C# Regular expression to find all instance and add a prefix

  • Thread starter Thread starter maziar.aflatoun
  • Start date Start date
M

maziar.aflatoun

Hi,

My knowledge of RE is really rusty now. Can someone please help me out
with this problem,

I make a request to our other server that's running php and I grab the
content of the directory in html format. Next I
want to search for all the instances of SRC="???"

<IMG SRC="/icons/back.gif" ALT="[DIR]"> <A HREF="/updates/">Parent
Directory</A> 06-Apr-2009 01:06 -
<IMG SRC="/icons/dir.gif" ALT="[DIR]"> <A HREF="ad/">ad/</
A> 06-Apr-2009 01:12 -
<IMG SRC="/icons/dir.gif" ALT="[DIR]"> <A HREF="beta/">beta/</
A> 10-Mar-2009 20:47 -
<IMG SRC="/icons/unknown.gif" ALT="[ ]"> <A
HREF="something">something..&gt;</A> 10-Apr-2008 00:13 94.9M

and replace it with http://www.ourdomain.com/??? so SRC="/icons/
back.gif" would become SRC="http://www.ourdomain.com/icons/back.gif".
I need this for even known image names in the future.

I have this so far

string regex = "src=\"([^\"]+)"; // which pulls out /icons/image.gif
page
Regex r = new Regex(regex, RegexOptions.IgnoreCase);

return r.Replace(html, "$1").Replace(?????????) // now how to do I
apply the prefix part?

Can you please help me out or show me more simple way?

Thanks
M.
 
Another option could be to use a base href tag :

http://www.drostdesigns.com/base-href-tag/

--
Patrice

<[email protected]> a écrit dans le message de groupe de discussion
: (e-mail address removed)...


My knowledge of RE is really rusty now. Can someone please help me out
with this problem,
I make a request to our other server that's running php and I grab the
content of the directory in html format. Next I
want to search for all the instances of SRC="???"
<IMG SRC="/icons/back.gif" ALT="[DIR]"> <A HREF="/updates/">Parent
Directory</A>        06-Apr-2009 01:06      -
<IMG SRC="/icons/dir.gif" ALT="[DIR]"> <A HREF="ad/">ad/</
A>                     06-Apr-2009 01:12      -
<IMG SRC="/icons/dir.gif" ALT="[DIR]"> <A HREF="beta/">beta/</
A>                   10-Mar-2009 20:47      -
<IMG SRC="/icons/unknown.gif" ALT="[   ]"> <A
HREF="something">something..&gt;</A> 10-Apr-2008 00:13  94.9M
and replace it withhttp://www.ourdomain.com/???so SRC="/icons/
back.gif" would become SRC="http://www.ourdomain.com/icons/back.gif".
I need this for even known image names in the future.
I have this so far
string regex = "src=\"([^\"]+)";  // which pulls out /icons/image..gif
page
Regex r = new Regex(regex, RegexOptions.IgnoreCase);
return r.Replace(html, "$1").Replace(?????????)  // now how to do I
apply the prefix part?
Can you please help me out or show me more simple way?
Thanks
M.- Hide quoted text -

- Show quoted text -

Thanks. But that's only for browser. Others need to access this
through code and <base href ..> is not going to work.
 
Hello (e-mail address removed),
Hi,

My knowledge of RE is really rusty now. Can someone please help me out
with this problem,

I make a request to our other server that's running php and I grab the
content of the directory in html format. Next I
want to search for all the instances of SRC="???"
<IMG SRC="/icons/back.gif" ALT="[DIR]"> <A HREF="/updates/">Parent
Directory</A> 06-Apr-2009 01:06 -
<IMG SRC="/icons/dir.gif" ALT="[DIR]"> <A HREF="ad/">ad/</
A>> 06-Apr-2009 01:12 -
A>>
<IMG SRC="/icons/dir.gif" ALT="[DIR]"> <A HREF="beta/">beta/</
A>> 10-Mar-2009 20:47 -
A>>
<IMG SRC="/icons/unknown.gif" ALT="[ ]"> <A
HREF="something">something..&gt;</A> 10-Apr-2008 00:13 94.9M
and replace it with http://www.ourdomain.com/??? so SRC="/icons/
back.gif" would become SRC="http://www.ourdomain.com/icons/back.gif".
I need this for even known image names in the future.

I have this so far

string regex = "src=\"([^\"]+)"; // which pulls out /icons/image.gif
page
Regex r = new Regex(regex, RegexOptions.IgnoreCase);
return r.Replace(html, "$1").Replace(?????????) // now how to do I
apply the prefix part?

Can you please help me out or show me more simple way?


I'd have a look at the HTML Agility Pack (search codeplex), this allows you
to read the HTML as if it were XML and makes it very simple to replace the
src parts you're trying to accomplish. This would be the best solution. If
the HTML would have been valid XHTML, you could have loaded it directly into
an XML dom document.

With Regex you'll have a few alternatives

1) Use a look behind:
Expression: (?<=src=")[^"]+
Replacement: http://www.domain.com$0

2) capturing
Expression: src="([^"]+)
Replacement: src="http://www.domain.com$1

3) capturing alternative
Expression: (src=")([^"]+)
Replacement: $1http://www.domain.com$2

4) Inserting just the part you need, by matching the insertion point
Expression: (?<=src=")
Replacement: http://www.domain.com

They all come down to the same thing basically.

Now keep in mind, that if the contents of the html have src= struff elsewhere,
these will get replaced as well. So use this only if you have control over
the contents you're reading, or make sure you monitor the correct behaviour
regularly.
 
Back
Top