regex expression

  • Thread starter Thread starter GS
  • Start date Start date
G

GS

what is a good general regex expression for html <img ....> tag?
I tried
"<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase)
but it is not quite working

thank you for your time
 
Hi,

You have to use the lazy modifier "?" so that the "*" quantifier doesn't match the trailing ">". In your example the "*" won't
match the trailing ">", so I think it's the "\-" that is causing you problems.

Try the following expression:

Regex re = new Regex(
@"<img .*?(/>|</img>)", // this pattern accounts for the XHTML as well as HTML standards
RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
 
thank you.

however I found the new expression failed to find the <img tag as in
<P><IMG height=168 src="test.bmp" width=235
border=0>&nbsp;&nbsp;&nbsp;&nbsp; Brought to you by <FONT size=4>Test ABC
Inc.</FONT></P>

So I remove the slash before the >
thus
myregex = New Regex("<img .*?(>|</img>)", RegexOptions.IgnoreCase Or
RegexOptions.ExplicitCapture) ' this is in vb

What problem I may encounter with the modified expression? please bear with
my lack of knowledge on xml.


BTW what about <object ... type="image/png"> any chance of that being mixed
with non image such as scripts or applets? At the moment the <object ..>
tags for image seem to be a nest of hornets.

Dave Sexton said:
Hi,

You have to use the lazy modifier "?" so that the "*" quantifier doesn't
match the trailing ">". In your example the "*" won't
match the trailing ">", so I think it's the "\-" that is causing you problems.

Try the following expression:

Regex re = new Regex(
@"<img .*?(/>|</img>)", // this pattern accounts for the XHTML as well as HTML standards
RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

--
Dave Sexton

what is a good general regex expression for html <img ....> tag?
I tried
"<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase)
but it is not quite working

thank you for your time
 
I actually now change the grouping

myregex = New Regex("<img .*?(>|(/>|(</img>))",
RegexOptions.IgnoreCase Or
RegexOptions.ExplicitCapture) ' this is in vb

I hope it does catch the XML <img .../> tags
have not got around deal with <image .../> -- I don't expect to run into
them in may application fro next year or more
Dave Sexton said:
Hi,

You have to use the lazy modifier "?" so that the "*" quantifier doesn't
match the trailing ">". In your example the "*" won't
match the trailing ">", so I think it's the "\-" that is causing you problems.

Try the following expression:

Regex re = new Regex(
@"<img .*?(/>|</img>)", // this pattern accounts for the XHTML as well as HTML standards
RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

--
Dave Sexton

what is a good general regex expression for html <img ....> tag?
I tried
"<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase)
but it is not quite working

thank you for your time
 
Hi,

You are correct that it doesn't properly handle img tags commonly found in HTML documents. Sorry about that.

If you don't have to account for a closing </img> tag then the following should work for HTML and most standard XHTML documents:

"<img .*?>"

In your other recent post your expression will work essentially the same as the one above.
BTW what about <object ... type="image/png"> any chance of that being mixed
with non image such as scripts or applets? At the moment the <object ..>
tags for image seem to be a nest of hornets.

If you need to match that as well then you'll have to use a more complex expression:

"<object .*?type=\"image/.*?\".*?(/>|</object>)"

HTH

--
Dave Sexton

GS said:
thank you.

however I found the new expression failed to find the <img tag as in
<P><IMG height=168 src="test.bmp" width=235
border=0>&nbsp;&nbsp;&nbsp;&nbsp; Brought to you by <FONT size=4>Test ABC
Inc.</FONT></P>

So I remove the slash before the >
thus
myregex = New Regex("<img .*?(>|</img>)", RegexOptions.IgnoreCase Or
RegexOptions.ExplicitCapture) ' this is in vb

What problem I may encounter with the modified expression? please bear with
my lack of knowledge on xml.


BTW what about <object ... type="image/png"> any chance of that being mixed
with non image such as scripts or applets? At the moment the <object ..>
tags for image seem to be a nest of hornets.

Dave Sexton said:
Hi,

You have to use the lazy modifier "?" so that the "*" quantifier doesn't
match the trailing ">". In your example the "*" won't
match the trailing ">", so I think it's the "\-" that is causing you problems.

Try the following expression:

Regex re = new Regex(
@"<img .*?(/>|</img>)", // this pattern accounts for the XHTML as well as HTML standards
RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

--
Dave Sexton

what is a good general regex expression for html <img ....> tag?
I tried
"<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase)
but it is not quite working

thank you for your time
 
wonderful explanation and help. thank you

BTW I initially had tried <img .*> which had a greedy propensity for
gobbling up everything to the last > in the line of html

so is the ?> making it non greedy expression?

Dave Sexton said:
Hi,

You are correct that it doesn't properly handle img tags commonly found in
HTML documents. Sorry about that.

If you don't have to account for a closing </img> tag then the following
should work for HTML and most standard XHTML documents:

"<img .*?>"

In your other recent post your expression will work essentially the same
as the one above.
BTW what about <object ... type="image/png"> any chance of that being
mixed
with non image such as scripts or applets? At the moment the <object ..>
tags for image seem to be a nest of hornets.

If you need to match that as well then you'll have to use a more complex
expression:

"<object .*?type=\"image/.*?\".*?(/>|</object>)"

HTH

--
Dave Sexton

GS said:
thank you.

however I found the new expression failed to find the <img tag as in
<P><IMG height=168 src="test.bmp" width=235
border=0>&nbsp;&nbsp;&nbsp;&nbsp; Brought to you by <FONT size=4>Test ABC
Inc.</FONT></P>

So I remove the slash before the >
thus
myregex = New Regex("<img .*?(>|</img>)", RegexOptions.IgnoreCase
Or
RegexOptions.ExplicitCapture) ' this is in vb

What problem I may encounter with the modified expression? please bear
with
my lack of knowledge on xml.


BTW what about <object ... type="image/png"> any chance of that being
mixed
with non image such as scripts or applets? At the moment the <object ..>
tags for image seem to be a nest of hornets.

Dave Sexton said:
Hi,

You have to use the lazy modifier "?" so that the "*" quantifier doesn't
match the trailing ">". In your example the "*" won't
match the trailing ">", so I think it's the "\-" that is causing you problems.

Try the following expression:

Regex re = new Regex(
@"<img .*?(/>|</img>)", // this pattern accounts for the XHTML as well as HTML standards
RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

--
Dave Sexton

what is a good general regex expression for html <img ....> tag?
I tried
"<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase)
but it is not quite working

thank you for your time
 
Hi,

Glad I could help.

? is a lazy modifier to the * quantifier, which alone matches as match as it can up to the first occurrence of the remainder of the
expression. The RegEx in this case matches everything up to the first occurrence of the remainder of the expression and then
matches the remainder once (the trailing > in this case), but no more.

The lazy modifier is, in that respect, like a positive look-ahead assertion asserting on the remainder of the expression, except
that it will match up to the end of the input string if necessary.

HTH

--
Dave Sexton

gs said:
wonderful explanation and help. thank you

BTW I initially had tried <img .*> which had a greedy propensity for gobbling up everything to the last > in the line of html

so is the ?> making it non greedy expression?

Dave Sexton said:
Hi,

You are correct that it doesn't properly handle img tags commonly found in HTML documents. Sorry about that.

If you don't have to account for a closing </img> tag then the following should work for HTML and most standard XHTML documents:

"<img .*?>"

In your other recent post your expression will work essentially the same as the one above.
BTW what about <object ... type="image/png"> any chance of that being mixed
with non image such as scripts or applets? At the moment the <object ..>
tags for image seem to be a nest of hornets.

If you need to match that as well then you'll have to use a more complex expression:

"<object .*?type=\"image/.*?\".*?(/>|</object>)"

HTH

--
Dave Sexton

GS said:
thank you.

however I found the new expression failed to find the <img tag as in
<P><IMG height=168 src="test.bmp" width=235
border=0>&nbsp;&nbsp;&nbsp;&nbsp; Brought to you by <FONT size=4>Test ABC
Inc.</FONT></P>

So I remove the slash before the >
thus
myregex = New Regex("<img .*?(>|</img>)", RegexOptions.IgnoreCase Or
RegexOptions.ExplicitCapture) ' this is in vb

What problem I may encounter with the modified expression? please bear with
my lack of knowledge on xml.


BTW what about <object ... type="image/png"> any chance of that being mixed
with non image such as scripts or applets? At the moment the <object ..>
tags for image seem to be a nest of hornets.

Hi,

You have to use the lazy modifier "?" so that the "*" quantifier doesn't
match the trailing ">". In your example the "*" won't
match the trailing ">", so I think it's the "\-" that is causing you
problems.

Try the following expression:

Regex re = new Regex(
@"<img .*?(/>|</img>)", // this pattern accounts for the XHTML as
well as HTML standards
RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

--
Dave Sexton

what is a good general regex expression for html <img ....> tag?
I tried
"<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase)
but it is not quite working

thank you for your time
 
GS said:
what is a good general regex expression for html <img ....> tag?
I tried
"<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase)
but it is not quite working

thank you for your time

It looks like you've already had a working answer, but I still want to
comment on a few issues.

By default, the . does not match newlines, so image tags like these

<img
src = "http://.../img.gif
/>

won't be matched. if you're expression is <img.*>. Adding or removing
the ? to make it <img.*?> doesn't change things. There is an option to
allow . to match newlines, but that option is potentionally very
resource intensive (if you're input is 2MB, it will match 2MB and start
backtracking from there).

A safer expression would be the following: <img[^>]*> this matches
everything between <img and > that is not a > itself. This will work in
most cases. There's one problem though, > is allowed within quotes if
you follow the standards. This can also be caught in regex:

<img("[^"]*"|'[^']*'|[^>])*>

If you'd want to catch the corresponding </img> tag as well things get
harder, though this is still possible to a certain degree.

First we match everything up to the end of the tag
<img("[^"]*"|'[^']*'|[^>])*

and then we match either /> or >......</img>

(/>|>.*?</img>)

As you can see I added the lazy modifier again, but this will suffer the
same issues as before, so is there a better solution you might ask...
And of course there is :).

By using a negative look-ahead we can match everything that is not the
start of </img as follows:

((?!</img).)*

Combine this with what we already had and you get this:

<img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img>)

Only one issue left to tackle. The </img> tag does not necessarily have
the closing > directly after the tagname. Whitespace is allowed in the
closing tag. This can easily be added:

<img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img\s+>)

Kind regards,

Jesse Houwing
 
Great answer with learning details. Thank you. keep up the good work

Jesse Houwing said:
GS said:
what is a good general regex expression for html <img ....> tag?
I tried
"<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase)
but it is not quite working

thank you for your time

It looks like you've already had a working answer, but I still want to
comment on a few issues.

By default, the . does not match newlines, so image tags like these

<img
src = "http://.../img.gif
/>

won't be matched. if you're expression is <img.*>. Adding or removing
the ? to make it <img.*?> doesn't change things. There is an option to
allow . to match newlines, but that option is potentionally very
resource intensive (if you're input is 2MB, it will match 2MB and start
backtracking from there).

A safer expression would be the following: <img[^>]*> this matches
everything between <img and > that is not a > itself. This will work in
most cases. There's one problem though, > is allowed within quotes if
you follow the standards. This can also be caught in regex:

<img("[^"]*"|'[^']*'|[^>])*>

If you'd want to catch the corresponding </img> tag as well things get
harder, though this is still possible to a certain degree.

First we match everything up to the end of the tag
<img("[^"]*"|'[^']*'|[^>])*

and then we match either /> or >......</img>

(/>|>.*?</img>)

As you can see I added the lazy modifier again, but this will suffer the
same issues as before, so is there a better solution you might ask...
And of course there is :).

By using a negative look-ahead we can match everything that is not the
start of </img as follows:

((?!</img).)*

Combine this with what we already had and you get this:

<img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img>)

Only one issue left to tackle. The </img> tag does not necessarily have
the closing > directly after the tagname. Whitespace is allowed in the
closing tag. This can easily be added:

<img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img\s+>)

Kind regards,

Jesse Houwing
 
Hi Jesse,

Thanks for brining up those points, but I wouldn't worry about performance or memory consumption issues related to the Multiline
flag when matching patterns in an html document. Pattern matching is slow by nature, and in this case it might not be executed in a
batch process where performance would really be a concern. Also, any expression will probably perform well when executed against any
standard-sized html document.

I think my solution with the addition of the Multiline option should be fine. If the user experiences performance issues due to the
expression, only then would I recommend that a more complex expression be used. A more complex expression is much harder to write
and debug, but it may perform better. Therefore, the user must make a trade-off decision, but I wouldn't recommend sacrificing ease
of writing and debugging, (and therefore, understanding), to address performance concerns that aren't real. When it's known whether
the expression is not going to perform well then the trade-off can be made.

Anyway, I followed your post and your points seemed to make perfect sense, but your expression didn't work when I tested it on the
following document:

string html = @"<html>
<head></head>
<body>
<img src=""test.jpg""></img>
</body>
</html>
";

0 matches.

And didn't work on this document either:

string html = @"<html>
<head></head>
<body>
<img src=""test.jpg""></img>
<img src=""test.jpg"" />
<img src=""test.jpg""></img>
</body>
</html>
";

1 match, but it's invalid:

{<img src="next.jpg"></img>
<img src="next.jpg" />}


Here's the code I used to test your expression:

System.Text.RegularExpressions.Regex re = new System.Text.RegularExpressions.Regex(
@"<img(""[^""]*""|'[^']*'|[^>])*(/>|>((?!</img).)*</img\s+>)");

foreach (System.Text.RegularExpressions.Match match in re.Matches(html))
{
match.GetType(); // break point in debugger
}

I didn't even attempt to do any debugging of my own :)

--
Dave Sexton

Jesse Houwing said:
GS said:
what is a good general regex expression for html <img ....> tag?
I tried
"<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase)
but it is not quite working

thank you for your time

It looks like you've already had a working answer, but I still want to comment on a few issues.

By default, the . does not match newlines, so image tags like these

<img
src = "http://.../img.gif
/>

won't be matched. if you're expression is <img.*>. Adding or removing the ? to make it <img.*?> doesn't change things. There is an
option to allow . to match newlines, but that option is potentionally very resource intensive (if you're input is 2MB, it will
match 2MB and start backtracking from there).

A safer expression would be the following: <img[^>]*> this matches everything between <img and > that is not a > itself. This will
work in most cases. There's one problem though, > is allowed within quotes if you follow the standards. This can also be caught
in regex:

<img("[^"]*"|'[^']*'|[^>])*>

If you'd want to catch the corresponding </img> tag as well things get harder, though this is still possible to a certain degree.

First we match everything up to the end of the tag
<img("[^"]*"|'[^']*'|[^>])*

and then we match either /> or >......</img>

(/>|>.*?</img>)

As you can see I added the lazy modifier again, but this will suffer the same issues as before, so is there a better solution you
might ask... And of course there is :).

By using a negative look-ahead we can match everything that is not the start of </img as follows:

((?!</img).)*

Combine this with what we already had and you get this:

<img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img>)

Only one issue left to tackle. The </img> tag does not necessarily have the closing > directly after the tagname. Whitespace is
allowed in the closing tag. This can easily be added:

<img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img\s+>)

Kind regards,

Jesse Houwing
 
Back
Top