Hello Michal,
The page in 4guysfromrolla.com (introduced by Ravikanth) and RegEx (introduced by another dev) could work for you.
However, there are some other issues. Even after you entirely strip out all the <htmltags> you may be left with HTML-
encoded strings such as which you will also want to parse. These are easily handled with
System.Web.HTTPUtility.HTMLDecode()
And now, the long explanation of why this won't be good enough. There are still many unresolved issues: (It was posted by
others before)
1) Rendered line feeds versus actual line feeds. In any HTML source the line feeds that are in there are generally NOT the
ones that are rendered. BR, P and others are the elements that determine the position on the rendered page.
2) What you're going to do with any elements outside the BODY tag, and what you are going to do with text that is left over
between elements such as OBJECT or SCRIPT?
3) Complex pages that have multiple DIV, LAYER or SPAN elements - some of which are only displayed conditionally
based on things such as browser version or client-side events.
4) TABLEs. Even though the HTML source for a table is entered in a left-to-right fashion, rows and columns can be spanned
so you may not find two words which are rendered together on the page to be next to each other in the source code.
Basically, you need to decide, in advance, what you are looking for and what your end result is going to be. If you're just
trying to parse a simple HTML page with a reasonably predictable format then a simple regex will do the trick. If you are
looking for specific elements with some important text then a regex and running a for...next loop through the matches would
be in order.
Best regards,
Yanhong Huang
Microsoft Online Partner Support
Get Secure! -
www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.
--------------------
!From: "Michal A. Valasek" <
[email protected]>
!Subject: How to strip HTML markup from string?
!Date: Sat, 9 Aug 2003 04:48:20 +0200
!Lines: 18
!X-Priority: 3
!X-MSMail-Priority: Normal
!X-Newsreader: Microsoft Outlook Express 6.00.2800.1158
!X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165
!Message-ID: <u5P#
[email protected]>
!Newsgroups: microsoft.public.dotnet.framework.aspnet
!NNTP-Posting-Host: gateway.haje.altaircom.net 62.24.73.162
!Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTNGP10.phx.gbl
!Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.framework.aspnet:166353
!X-Tomcat-NG: microsoft.public.dotnet.framework.aspnet
!
!Hello,
!
!I want to transform text with HTML markup to plain text. Is there some
!simple way how to do it?
!
!I can surely write my own function, which would simply strip everything with
!< and >. But if someonew has already written something similar for .NET, I
!would prefer more clever solution, which would try to retain original
!layout, at least paragraphs, hyperlinks etc - something like Outlook does
!when changing HTML to plain text.
!
!
!--
!Michal A. Valasek, Altair Communications,
http://www.altaircom.net
!Please do not reply to this e-mail, for contact see
http://www.rider.cz
!Keeping Freedom safe from Democracy
!
!
!