Regex help!

  • Thread starter Thread starter Rob
  • Start date Start date
R

Rob

Hi,
I've written a small VB application that parses an HTML document and
removes code I don't need and re-writes the file. I'm looking for the
regex pattern that will remove the following code:

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
..shape {behavior:url(#default#VML);}
</style>
<![endif]-->
<title>T1 CRR Online Manual</title>
<style>
<!--

-->
</style>
</head>

...of course the page continues after the </head> tag but I want to
remove everything within the head tags including the head tags. This has
to work on any HTML file that I parse so the contents within the head
tags may be different. This is what I've got so far:

pattern = "<head>[.|\s|\n]*<\/head>"
returntext = Regex.Replace(returntext, pattern, "")

...but this doesn't work. Anyone out there with a solution?

Thanks
Rob
 
Hi,
I've written a small VB application that parses an HTML document and
removes code I don't need and re-writes the file. I'm looking for the
regex pattern that will remove the following code:

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style>
<![endif]-->
<title>T1 CRR Online Manual</title>
<style>
<!--

-->
</style>
</head>

..of course the page continues after the </head> tag but I want to
remove everything within the head tags including the head tags. This has
to work on any HTML file that I parse so the contents within the head
tags may be different. This is what I've got so far:

pattern = "<head>[.|\s|\n]*<\/head>"
returntext = Regex.Replace(returntext, pattern, "")

..but this doesn't work. Anyone out there with a solution?

Thanks
Rob

*** Sent via Developersdexhttp://www.developersdex.com***

How about:

Regex.Replace(returnText, "<head>.*</head>", "",
RegexOptions.Singleline)

Thanks,

Seth Rowe
 
Thanks Seth,

That works...but only with the RegexOptions.SingleLine parameter. I
didn't have that before.

Rob
 
Thanks Seth,

That works...but only with the RegexOptions.SingleLine parameter. I
didn't have that before.

Rob

*** Sent via Developersdexhttp://www.developersdex.com***

That because the singleline option tells the period to also match the
newline character.

Thanks,

Seth Rowe
 
Back
Top