Regexp help

  • Thread starter Thread starter Steffan A. Cline
  • Start date Start date
S

Steffan A. Cline

I am working on some maintenance coding and am stuck. This program that was
written parses web pages and downloads the files somewhat like a crawler.

The code is like this in it's original form.

// get form view state and event validation
int viewstateIndex = content.IndexOf("id=\"__VIEWSTATE\"") + 24;
int viewstateEndIndex = content.IndexOf("\" />", viewstateIndex);

_viewstate = HttpUtility.UrlEncodeUnicode(content.Substring(viewstateIndex,
viewstateEndIndex - viewstateIndex));

I am thinking it should look more like this to be reliable but I am unsure
of how to use the $1 syntax to extract the contents of the match.

Match _viewstateValue = Regex.Match(content, "<input .*?id=\"__VIEWSTATE\"
..*?value=\"([^\"].*)");
_viewstate = HttpUtility.UrlEncodeUnicode(_viewstateValue.ToString());


int eventIndex = content.IndexOf("id=\"__EVENTVALIDATION\"") + 30;
int eventEndIndex = content.IndexOf("\" />", eventIndex);

_eventvalidation =
HttpUtility.UrlEncodeUnicode(content.Substring(eventIndex, eventEndIndex -
eventIndex));

And again... I tried but I need to use the $1 syntax to get the value in the
middle and not the whole string.

Match _eventvalidationValue = Regex.Match(content, "<input
..*?id=\"__EVENTVALIDATION\" .*?value=\"([^\"].*)");
_eventvalidation =
HttpUtility.UrlEncodeUnicode(_eventvalidationValue.ToString());

Obviously the string chopping will work but regexp seems more stable as far
as flexibility etc. Anyone have any suggestions on these?

Thanks,
Steffan
 
I am working on some maintenance coding and am stuck. This program that was
written parses web pages and downloads the files somewhat like a crawler.

The code is like this in it's original form.

// get form view state and event validation
int viewstateIndex = content.IndexOf("id=\"__VIEWSTATE\"") + 24;
int viewstateEndIndex = content.IndexOf("\" />", viewstateIndex);

_viewstate = HttpUtility.UrlEncodeUnicode(content.Substring(viewstateIndex,
viewstateEndIndex - viewstateIndex));

I am thinking it should look more like this to be reliable but I am unsure
of how to use the $1 syntax to extract the contents of the match.

Match _viewstateValue = Regex.Match(content, "<input .*?id=\"__VIEWSTATE\"
.*?value=\"([^\"].*)");

_viewstate = HttpUtility.UrlEncodeUnicode(_viewstateValue.ToString());


int eventIndex = content.IndexOf("id=\"__EVENTVALIDATION\"") + 30;
int eventEndIndex = content.IndexOf("\" />", eventIndex);

_eventvalidation =
HttpUtility.UrlEncodeUnicode(content.Substring(eventIndex, eventEndIndex -
eventIndex));

And again... I tried but I need to use the $1 syntax to get the value in the
middle and not the whole string.

Match _eventvalidationValue = Regex.Match(content, "<input
.*?id=\"__EVENTVALIDATION\" .*?value=\"([^\"].*)");

_eventvalidation =
HttpUtility.UrlEncodeUnicode(_eventvalidationValue.ToString());

Obviously the string chopping will work but regexp seems more stable as far
as flexibility etc. Anyone have any suggestions on these?

Thanks,
Steffan

No Takers?
 
I am working on some maintenance coding and am stuck. This program that was
written parses web pages and downloads the files somewhat like a crawler.

The code is like this in it's original form.

// get form view state and event validation
int viewstateIndex = content.IndexOf("id=\"__VIEWSTATE\"") + 24;
int viewstateEndIndex = content.IndexOf("\" />", viewstateIndex);

_viewstate = HttpUtility.UrlEncodeUnicode(content.Substring(viewstateIndex,
viewstateEndIndex - viewstateIndex));

I am thinking it should look more like this to be reliable but I am unsure
of how to use the $1 syntax to extract the contents of the match.

Match _viewstateValue = Regex.Match(content, "<input .*?id=\"__VIEWSTATE\"
.*?value=\"([^\"].*)");

_viewstate = HttpUtility.UrlEncodeUnicode(_viewstateValue.ToString());


int eventIndex = content.IndexOf("id=\"__EVENTVALIDATION\"") + 30;
int eventEndIndex = content.IndexOf("\" />", eventIndex);

_eventvalidation =
HttpUtility.UrlEncodeUnicode(content.Substring(eventIndex, eventEndIndex -
eventIndex));

And again... I tried but I need to use the $1 syntax to get the value in the
middle and not the whole string.

Match _eventvalidationValue = Regex.Match(content, "<input
.*?id=\"__EVENTVALIDATION\" .*?value=\"([^\"].*)");

_eventvalidation =
HttpUtility.UrlEncodeUnicode(_eventvalidationValue.ToString());

Obviously the string chopping will work but regexp seems more stable as far
as flexibility etc. Anyone have any suggestions on these?

I've been playing with this for a while and am getting nowhere. Is it
required to replace the text rather than just extract it?

Thanks,
Steffan
 
Hello Steffan,
I am working on some maintenance coding and am stuck. This program
that was written parses web pages and downloads the files somewhat
like a crawler.

The code is like this in it's original form.

// get form view state and event validation
int viewstateIndex = content.IndexOf("id=\"__VIEWSTATE\"") + 24;
int viewstateEndIndex = content.IndexOf("\" />", viewstateIndex);
_viewstate =
HttpUtility.UrlEncodeUnicode(content.Substring(viewstateIndex,
viewstateEndIndex - viewstateIndex));

I am thinking it should look more like this to be reliable but I am
unsure of how to use the $1 syntax to extract the contents of the
match.

Match _viewstateValue = Regex.Match(content, "<input
.*?id=\"__VIEWSTATE\" .*?value=\"([^\"].*)");

_viewstate =
HttpUtility.UrlEncodeUnicode(_viewstateValue.ToString());

int eventIndex = content.IndexOf("id=\"__EVENTVALIDATION\"") + 30;
int eventEndIndex = content.IndexOf("\" />", eventIndex);

_eventvalidation =
HttpUtility.UrlEncodeUnicode(content.Substring(eventIndex,
eventEndIndex - eventIndex));

And again... I tried but I need to use the $1 syntax to get the
value in the middle and not the whole string.

Match _eventvalidationValue = Regex.Match(content, "<input
.*?id=\"__EVENTVALIDATION\" .*?value=\"([^\"].*)");

_eventvalidation =
HttpUtility.UrlEncodeUnicode(_eventvalidationValue.ToString());
Obviously the string chopping will work but regexp seems more stable
as far as flexibility etc. Anyone have any suggestions on these?
I've been playing with this for a while and am getting nowhere. Is it
required to replace the text rather than just extract it?

You can use capturing groups to extract values from a Match:

Regex rx = new Regex("<input.*?id=\"__EVENTVALIDATION\" .*?value=\"(?<Value>[^\"].*?)";

Match m = rx.Match(string);
if (m.Success)
{
// string you want is in
m.Groups["Value"].Value
}

Alternatively you could use look arounds to find only the value you were
looking for:

(?<=("<input.*?id=\"__EVENTVALIDATION\" .*?value=\").*?(?=\")

and use

Match m = rx.Match(string);
if (m.Success)
{
// string you want is in
m.Value
}

In your regexs make sure not to use too many .*'s... in many cases you're
better off using [^"]* or [^>]* while matching inside tag contents.

Even better would be to load the HTML into a parser and extract the value
from there, that would be much stabler and safer depending on how much the
generated code might change. Look for the HTMLAgilityPack on Codeplex http://www.codeplex.com/htmlagilitypack
 
Back
Top