Regexp help

Steffan A. Cline · Mar 18, 2009

I am working on some maintenance coding and am stuck. This program that was
written parses web pages and downloads the files somewhat like a crawler.

The code is like this in it's original form.

// get form view state and event validation
int viewstateIndex = content.IndexOf("id=\"__VIEWSTATE\"") + 24;
int viewstateEndIndex = content.IndexOf("\" />", viewstateIndex);

_viewstate = HttpUtility.UrlEncodeUnicode(content.Substring(viewstateIndex,
viewstateEndIndex - viewstateIndex));

I am thinking it should look more like this to be reliable but I am unsure
of how to use the $1 syntax to extract the contents of the match.

Match _viewstateValue = Regex.Match(content, "<input .*?id=\"__VIEWSTATE\"
..*?value=\"([^\"].*)");
_viewstate = HttpUtility.UrlEncodeUnicode(_viewstateValue.ToString());

int eventIndex = content.IndexOf("id=\"__EVENTVALIDATION\"") + 30;
int eventEndIndex = content.IndexOf("\" />", eventIndex);

_eventvalidation =
HttpUtility.UrlEncodeUnicode(content.Substring(eventIndex, eventEndIndex -
eventIndex));

And again... I tried but I need to use the $1 syntax to get the value in the
middle and not the whole string.

Match _eventvalidationValue = Regex.Match(content, "<input
..*?id=\"__EVENTVALIDATION\" .*?value=\"([^\"].*)");
_eventvalidation =
HttpUtility.UrlEncodeUnicode(_eventvalidationValue.ToString());

Obviously the string chopping will work but regexp seems more stable as far
as flexibility etc. Anyone have any suggestions on these?

Thanks,
Steffan

Steffan A. Cline · Mar 21, 2009

I am working on some maintenance coding and am stuck. This program that was
written parses web pages and downloads the files somewhat like a crawler.

The code is like this in it's original form.

// get form view state and event validation
int viewstateIndex = content.IndexOf("id=\"__VIEWSTATE\"") + 24;
int viewstateEndIndex = content.IndexOf("\" />", viewstateIndex);

_viewstate = HttpUtility.UrlEncodeUnicode(content.Substring(viewstateIndex,
viewstateEndIndex - viewstateIndex));

I am thinking it should look more like this to be reliable but I am unsure
of how to use the $1 syntax to extract the contents of the match.

Match _viewstateValue = Regex.Match(content, "<input .*?id=\"__VIEWSTATE\"
.*?value=\"([^\"].*)");

_viewstate = HttpUtility.UrlEncodeUnicode(_viewstateValue.ToString());

int eventIndex = content.IndexOf("id=\"__EVENTVALIDATION\"") + 30;
int eventEndIndex = content.IndexOf("\" />", eventIndex);

_eventvalidation =
HttpUtility.UrlEncodeUnicode(content.Substring(eventIndex, eventEndIndex -
eventIndex));

And again... I tried but I need to use the $1 syntax to get the value in the
middle and not the whole string.

Match _eventvalidationValue = Regex.Match(content, "<input
.*?id=\"__EVENTVALIDATION\" .*?value=\"([^\"].*)");

_eventvalidation =
HttpUtility.UrlEncodeUnicode(_eventvalidationValue.ToString());

Obviously the string chopping will work but regexp seems more stable as far
as flexibility etc. Anyone have any suggestions on these?

Thanks,
Steffan

No Takers?

Steffan A. Cline · Mar 23, 2009

I am working on some maintenance coding and am stuck. This program that was
written parses web pages and downloads the files somewhat like a crawler.

The code is like this in it's original form.

// get form view state and event validation
int viewstateIndex = content.IndexOf("id=\"__VIEWSTATE\"") + 24;
int viewstateEndIndex = content.IndexOf("\" />", viewstateIndex);

_viewstate = HttpUtility.UrlEncodeUnicode(content.Substring(viewstateIndex,
viewstateEndIndex - viewstateIndex));

I am thinking it should look more like this to be reliable but I am unsure
of how to use the $1 syntax to extract the contents of the match.

Match _viewstateValue = Regex.Match(content, "<input .*?id=\"__VIEWSTATE\"
.*?value=\"([^\"].*)");

_viewstate = HttpUtility.UrlEncodeUnicode(_viewstateValue.ToString());

int eventIndex = content.IndexOf("id=\"__EVENTVALIDATION\"") + 30;
int eventEndIndex = content.IndexOf("\" />", eventIndex);

_eventvalidation =
HttpUtility.UrlEncodeUnicode(content.Substring(eventIndex, eventEndIndex -
eventIndex));

And again... I tried but I need to use the $1 syntax to get the value in the
middle and not the whole string.

Match _eventvalidationValue = Regex.Match(content, "<input
.*?id=\"__EVENTVALIDATION\" .*?value=\"([^\"].*)");

_eventvalidation =
HttpUtility.UrlEncodeUnicode(_eventvalidationValue.ToString());

Obviously the string chopping will work but regexp seems more stable as far
as flexibility etc. Anyone have any suggestions on these?

Click to expand...

I've been playing with this for a while and am getting nowhere. Is it
required to replace the text rather than just extract it?

Thanks,
Steffan

Jesse Houwing · Mar 30, 2009

Hello Steffan,

I am working on some maintenance coding and am stuck. This program
that was written parses web pages and downloads the files somewhat
like a crawler.

The code is like this in it's original form.

// get form view state and event validation
int viewstateIndex = content.IndexOf("id=\"__VIEWSTATE\"") + 24;
int viewstateEndIndex = content.IndexOf("\" />", viewstateIndex);
_viewstate =
HttpUtility.UrlEncodeUnicode(content.Substring(viewstateIndex,
viewstateEndIndex - viewstateIndex));

I am thinking it should look more like this to be reliable but I am
unsure of how to use the $1 syntax to extract the contents of the
match.

Match _viewstateValue = Regex.Match(content, "<input
.*?id=\"__VIEWSTATE\" .*?value=\"([^\"].*)");

_viewstate =
HttpUtility.UrlEncodeUnicode(_viewstateValue.ToString());

int eventIndex = content.IndexOf("id=\"__EVENTVALIDATION\"") + 30;
int eventEndIndex = content.IndexOf("\" />", eventIndex);

_eventvalidation =
HttpUtility.UrlEncodeUnicode(content.Substring(eventIndex,
eventEndIndex - eventIndex));

And again... I tried but I need to use the $1 syntax to get the
value in the middle and not the whole string.

Match _eventvalidationValue = Regex.Match(content, "<input
.*?id=\"__EVENTVALIDATION\" .*?value=\"([^\"].*)");

_eventvalidation =
HttpUtility.UrlEncodeUnicode(_eventvalidationValue.ToString());
Obviously the string chopping will work but regexp seems more stable
as far as flexibility etc. Anyone have any suggestions on these?

Click to expand...

Click to expand...

I've been playing with this for a while and am getting nowhere. Is it
required to replace the text rather than just extract it?

You can use capturing groups to extract values from a Match:

Regex rx = new Regex("<input.*?id=\"__EVENTVALIDATION\" .*?value=\"(?<Value>[^\"].*?)";

Match m = rx.Match(string);
if (m.Success)
{
// string you want is in
m.Groups["Value"].Value
}

Alternatively you could use look arounds to find only the value you were
looking for:

(?<=("<input.*?id=\"__EVENTVALIDATION\" .*?value=\").*?(?=\")

and use

Match m = rx.Match(string);
if (m.Success)
{
// string you want is in
m.Value
}

In your regexs make sure not to use too many .*'s... in many cases you're
better off using [^"]* or [^>]* while matching inside tag contents.

Even better would be to load the HTML into a parser and extract the value
from there, that would be much stabler and safer depending on how much the
generated code might change. Look for the HTMLAgilityPack on Codeplex http://www.codeplex.com/htmlagilitypack

Regexp help

Steffan A. Cline

Steffan A. Cline

Steffan A. Cline

Jesse Houwing