Regex Matching Question

  • Thread starter Thread starter George Durzi
  • Start date Start date
G

George Durzi

Consider this excerpt from some HTML. (This is a copy from View->Source,
except for the comment)

<TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0>
<?xml version="1.0" encoding="UTF-16"?>
<!-- need to extract whatever is here -->
</TABLE>

I need to extract all the HTML that would be in the <!-- need to extract
whatever is here --> section. So I did the following.

1. Retrieve the HTML into a string variable
Interesting observation: when I look at the contents of the string, every
double quote has been escaped, so they all show as \" instead of "

2. Remove carriage returns and newlines from the string
ResultHtml = ResultHtml.Replace("\r", string.Empty);
ResultHtml = ResultHtml.Replace("\n", string.Empty);

3. Use a Regex to try and find a match

string sFind = "<TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0><?xml
version=\"1.0\" encoding=\"UTF-16\"?>" + ((.|\n)*?) + "</TABLE>";
Regex rx = new Regex(sFind,
RegexOptions.IgnoreCase|RegexOptions.IgnorePatternWhitespace);
Match m1 = rx.Match(ResultHtml);
if (m1.Success)
// do something


I never get a match ... I tried this with some simpler HTML and the regex
works fine to retrieve what was between two table tags

I also tried stripping all double quotes from ResultHtml, and them trying:

string sFind = "<TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0><?xml
version=1.0 encoding=UTF-16?>" + ((.|\n)*?) + "</TABLE>";

Still no match..

The string in my HTML which I'm trying to match exists exactly as in sFind.

Any idea?
 
George, try this


using System.Text.RegularExpressions;

Regex regex = new Regex(
@"(?<=.*?<!--\s*)(.*?)(?=\s*-->)",
RegexOptions.IgnoreCase
| RegexOptions.Multiline
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);


Alexey
 
Back
Top