HTML And Regular Explression

Ori · Jan 7, 2004

Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help….

Thanks,

Ori.

Nick Harris · Jan 7, 2004

yeah, use regular expressions. ;-)

Identify a uniqe patten in the web page. Say this is the stream from the
page:

<html>
<head>
<table>
<tr>
<td> Dentist </td>
<td>Phone</td>
</tr>
</table>
</head>
</html>

In this case you want to find the 2nd <td> tag and it's contents.

/.*<td>.*<td>(.*)<\/td>/

The expression above says give me all characters until you find the <td>
tag, then give me everything until you find another one. When you do store
the value between the send <td> tag and the first </td> tag you find and put
it in a variable for me.
that special var is $1. This is VI syntax, might be a bit diff in M$ land.

Hope this helps.

Nick Harris, MCSD
http://www.VizSoft.net

Peter Rilling · Jan 7, 2004

RegularExpressions are greedy by default. If you try to get everything
between the open and close tags, the engine will look for the last closing
tag and grab everthing in between. ".*" will consume as much as it can
before returning, so be careful of ".*". For example, if you have
<td>dsafdsfsf</td><td>fdsafdsfs</td>, then a regular expression like
<td>.*</td> will match on the entire content even thought you would
logically want two matches.

..NET has non-greedy modifies for their quantifiers. You might want to try
something like "\<html\>(.*?)\<html\>"

NOTE: Since most pages will only have one set of HTML tags, you might be
able to use something simple like "\<html\>(.*)\<html\>"

I have not actually tried to make sure the syntax is correct, but this
should give you an idea. This is all from memory. Also, the .NET
documentation describes all the quantifies that they support.

BMermuys · Jan 8, 2004

Hi,

Ori said:
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

If you don't mind loosing all formating, having blank spaces and crlfs which
are ignored in html, then you can do it with regex replace:

string html ="..........";
string text = Regex.Replace( html, "\\<.*?\\>", "" );

HTH,
greetings

Regular Expression	1	Dec 23, 2003
regular expression	7	Nov 27, 2012
Regular Expression Frustration	1	Apr 16, 2008
Regular expression	5	Jan 17, 2013
Retrieve tag A from html	5	Oct 5, 2007
regular expression NxM	9	Jul 29, 2010
Regular Expression?	3	Aug 1, 2008
regular expressions	5	Apr 11, 2010

HTML And Regular Explression

Ori

Nick Harris

Peter Rilling

BMermuys

Ask a Question

Similar Threads