Regex help

  • Thread starter Thread starter JKJ
  • Start date Start date
J

JKJ

I need help with a regular expression that will pull the
title and all the meta tags held in the head section of an
HTML file (including the head tags). I want to exclude
everything else such as link tags, script tags, etc. I
have a pretty big process that pulls this stuff now using
simple Regex expressions, but I know I'm not using the
Regex's to their fullest. . .
 
I need help with a regular expression that will pull the
title and all the meta tags held in the head section of an
HTML file (including the head tags). I want to exclude
everything else such as link tags, script tags, etc. I
have a pretty big process that pulls this stuff now using
simple Regex expressions, but I know I'm not using the
Regex's to their fullest. . .

I think you need to explain what you want a little more. What exactly is
the input to the regular expression, and what are you expecting as the
output? Perhaps a simple example and or a sample of what you're doing now?
 
Here is an example of what I have done with an explaination

public static MatchCollection HtmlMatchCollection(string input, string matchstr)
{
string expression;

expression = HttpUtilities.FullHtmlExpression(matchstr);
MatchCollection mc = Regex.Matches(input, expression,
RegexOptions.Multiline |
RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace);

return mc;
}

public static string FullHtmlExpression(string str)
{
/* Example
* <td colspan=2><img src="/images/b.gif" alt="" width="1" height="25"></td>
*
* data1 ==> colSpan=2
* data2 ==> <img src="/images/b.gif" alt="" width="1" height="25">
*
*/
string expression =
"<" + str + // (?# Match the character sequence <"str")
"(?<data1>.*?)" + // (?# Capture the characters between <"str" and > )
">" + // (?# Match the > character )
"(?<data2>.*?)" + // (?# Capture the characters between <"str"> and </"str">)
"</" + str + ">" ; // (?# Match the closing </"str">)

return expression;
}

Cheers,
Dave
 
Back
Top