Regular expression for cleaning html safely

  • Thread starter Thread starter Steve B.
  • Start date Start date
S

Steve B.

Hi,

I'm building a web site that can render html from various user input.
The problem is that the html cannot be trusted, so I need to ensure it does
not contain script attack injection.
That's why I'd like to provide a set of allowed tag and to remove other
ones.

I think about regular expression. However, I was able to find some regex
samples that remove a set a untrusted tags (scripts, iframe, etc), but I'd
like to allow only a set of tag, because the regex can only remove "well
formed" tags : <script> w/o </script> wont't be removed.

So does anyone have a regex that remove any content between tags that are
not in a safe list ?
And if possible, is it possible to remove any attribute that can be
potentially dangerous ? (<span onload="javascript:attack(...)">)

Thanks in advance
 
You may give www.regexlib.com a shot.

Hi,

I'm building a web site that can render html from various user input.
The problem is that the html cannot be trusted, so I need to ensure it does
not contain script attack injection.
That's why I'd like to provide a set of allowed tag and to remove other
ones.

I think about regular expression. However, I was able to find some regex
samples that remove a set a untrusted tags (scripts, iframe, etc), but I'd
like to allow only a set of tag, because the regex can only remove "well
formed" tags : <script> w/o </script> wont't be removed.

So does anyone have a regex that remove any content between tags that are
not in a safe list ?
And if possible, is it possible to remove any attribute that can be
potentially dangerous ? (<span onload="javascript:attack(...)">)

Thanks in advance
 
Back
Top