GS said:
what is a good general regex expression for html <img ....> tag?
I tried
"<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase)
but it is not quite working
thank you for your time
It looks like you've already had a working answer, but I still want to
comment on a few issues.
By default, the . does not match newlines, so image tags like these
<img
src = "
http://.../img.gif
/>
won't be matched. if you're expression is <img.*>. Adding or removing
the ? to make it <img.*?> doesn't change things. There is an option to
allow . to match newlines, but that option is potentionally very
resource intensive (if you're input is 2MB, it will match 2MB and start
backtracking from there).
A safer expression would be the following: <img[^>]*> this matches
everything between <img and > that is not a > itself. This will work in
most cases. There's one problem though, > is allowed within quotes if
you follow the standards. This can also be caught in regex:
<img("[^"]*"|'[^']*'|[^>])*>
If you'd want to catch the corresponding </img> tag as well things get
harder, though this is still possible to a certain degree.
First we match everything up to the end of the tag
<img("[^"]*"|'[^']*'|[^>])*
and then we match either /> or >......</img>
(/>|>.*?</img>)
As you can see I added the lazy modifier again, but this will suffer the
same issues as before, so is there a better solution you might ask...
And of course there is
data:image/s3,"s3://crabby-images/1dcd8/1dcd8f45ac1db0b678175455bb753df93538b6b5" alt="Smile :) :)"
.
By using a negative look-ahead we can match everything that is not the
start of </img as follows:
((?!</img).)*
Combine this with what we already had and you get this:
<img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img>)
Only one issue left to tackle. The </img> tag does not necessarily have
the closing > directly after the tagname. Whitespace is allowed in the
closing tag. This can easily be added:
<img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img\s+>)
Kind regards,
Jesse Houwing