Parsing / processing a stream of HTML

Mark Rae · Feb 23, 2004

Hi,

I'm using HttpWebRequest and HttpWebResponse to return a stream of HTML.
Looking for advice as to the accepted / easiest / most efficient way to
process this HTML in the background i.e. I don't want to display it all to
the user, just pull out certain pieces of it.

Specifically, I'm looking to evaluate the tabledefs it contains - walk
through their rows and columns etc.

Any assistance gratefully received as ever.

Best regards,

Mark Rae

Jay Douglas · Feb 23, 2004

Mark,
I would seriously consider using regular expressions to extract the
content you are looking for out of your html string.

http://www.regular-expressions.info/dotnet.html
http://www.ondotnet.com/pub/a/dotnet/2002/03/11/regex2.html

Mark Rae · Feb 23, 2004

Jay Douglas said:
Mark,
I would seriously consider using regular expressions to extract the
content you are looking for out of your html string.

http://www.regular-expressions.info/dotnet.html
http://www.ondotnet.com/pub/a/dotnet/2002/03/11/regex2.html

Thanks for the reply. Will that, e.g. allow me to extract all the text
between "<table" and "</table>"?

Alternatively, is there a way to reference a stream of HTML and treat it as
if it were an HTML document from which I could evaluate the tabledefs
collection etc?

Mark

Jay Douglas · Feb 23, 2004

Mark,

With regular expressions, you can extract text from all sorts of
different patterns including text in-between table tags.

Now about changing attributes and elements of the html string... I've
seen some examples where html is actually transformed into xml string and
then attributes of certain elements are then modified then returned back to
an html string.

Here's a link to start your research with:

http://www.fawcette.com/vsm/2002_03/online/online_eprods/c_wagner_03_18/

Mark Rae · Feb 24, 2004

message
Jay,

With regular expressions, you can extract text from all sorts of
different patterns including text in-between table tags.

Now about changing attributes and elements of the html string... I've
seen some examples where html is actually transformed into xml string and
then attributes of certain elements are then modified then returned back to
an html string.

Here's a link to start your research with:

http://www.fawcette.com/vsm/2002_03/online/online_eprods/c_wagner_03_18/

Thanks for this. I looked at it, and found that it was more than I needed.

In the end, I extracted the various <tr>...</tr> lines out of the HTML
stream, and then processeded them with the standard Substring() and
IndexOf() methods of the String object.

Job done.

Best,

Mark

Joerg Jooss · Feb 28, 2004

Jay said:
Mark,

With regular expressions, you can extract text from all sorts of
different patterns including text in-between table tags.

Not really. You cannot match corresponding opening and closing tags for
example, because there's no way to express such constructs using regular
expressions (see context-free grammars).

I'd rather use a real parser such as the Chris Lovett's SGML parser.

http://www.gotdotnet.com/Community/...mpleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC

Cheers,

Mark Rae · Feb 28, 2004

Not really. You cannot match corresponding opening and closing tags for
example, because there's no way to express such constructs using regular
expressions (see context-free grammars).

I'm having no problems thus far extracting strings between the following
tags:

<tr>...</tr>
<td>...</td>

I'd rather use a real parser such as the Chris Lovett's SGML parser.

http://www.gotdotnet.com/Community/...mpleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC

Very useful!

Mark

Joerg Jooss · Feb 29, 2004

Mark said:
I'm having no problems thus far extracting strings between the
following tags:

<tr>...</tr>
<td>...</td>
<p>...</p>

Sure, provided they all match nicely. What about <p><p></p> ;-)

Parsing multipart/form-data stream	2	Nov 7, 2006
Why can't I POST to this page with HttpWebRequest?	1	Jun 26, 2006
Strange behavior when closing stream	10	Oct 24, 2006
HttpWebResponse.GetResponseStream returns incomplete stream	7	Aug 22, 2006
HttpWebRequest and Multi Threaded Apps	4	Aug 10, 2005
Stream text/data to zip/gpg instead of passing app a filename?	1	Feb 23, 2009
HTML to PDF	3	Mar 27, 2004
How to realize a C# page that give back an image instead of an html page	1	Dec 16, 2006

Parsing / processing a stream of HTML

Mark Rae

Jay Douglas

Mark Rae

Jay Douglas

Mark Rae

Joerg Jooss

Mark Rae

Joerg Jooss

Ask a Question

Similar Threads