Regular expression to remove all html tags except for p and br

Guest · Feb 21, 2004

Hi all

Can someone help me out with a regex to remove all html tags except for ,, , from a string

Thank

Jim

Gary Chang · Feb 21, 2004

Hi jim,

Thanks for posting in the community.

Currently I am looking for somebody who could help you on it. We will reply
here with more information as soon as possible.
If you have any more concerns on it, please feel free to post here.

Thanks!

Best regards,

Gary Chang
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.
--------------------

Guest · Feb 24, 2004

Hi Gary

I'm just following up to see if you have had any luck with this

Thank

Jim

Tian Min Huang · Feb 24, 2004

Hello Jim,

Thanks for your post. I wrote the following pattern which will remove all
html tags except for , , and :

<[^/bp][^>]*>|<p[a-z][^>]*>|<b[^r][^>]*>|<br[a-z][^>]*>|</[^bp]+>|</p[a-z]+>
|</b[^r]+>|</br[a-z]+>

Please check it on your side and let know your result.

Have a nice day!

Regards,

HuangTM
Microsoft Online Partner Support
MCSE/MCSD

Get Secure! -- www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.

Eric Gunnerson [MS] · Feb 25, 2004

With negative lookahead in .NET regular expressions, you can write this in a
much simpler form:

<(?!br|/br|p|/p>.+?>

That will match everything inside of <> except for br, /br, p, or /p, and
you can use that to replace all those tags with an empty string. This is
also more robust as you don't have to make sure you hit all the tags. I
noticed that <script> is noticeably absent from the list below, which could
possibly lead to a security exploit (somebody enters script code, and when
it gets echoed back, it executes on a user's computer).

You will want to use a case-insensitive match or you won't allow the
uppercase versions of the strings.

--
Eric Gunnerson

Visit the C# product team at http://www.csharp.net
Eric's blog is at http://weblogs.asp.net/ericgu/

This posting is provided "AS IS" with no warranties, and confers no rights.

Tian Min Huang said:
Hello Jim,

Thanks for your post. I wrote the following pattern which will remove all
html tags except for , , and :

|</b[^r]+>|</br[a-z]+>

Please check it on your side and let know your result.

Have a nice day!

Regards,

HuangTM
Microsoft Online Partner Support
MCSE/MCSD

Get Secure! -- www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.

Regular expression to remove all html tags except for p and br

Guest

Gary Chang

Guest

Tian Min Huang

Eric Gunnerson [MS]