Parsing / processing a stream of HTML

  • Thread starter Thread starter Mark Rae
  • Start date Start date
M

Mark Rae

Hi,

I'm using HttpWebRequest and HttpWebResponse to return a stream of HTML.
Looking for advice as to the accepted / easiest / most efficient way to
process this HTML in the background i.e. I don't want to display it all to
the user, just pull out certain pieces of it.

Specifically, I'm looking to evaluate the tabledefs it contains - walk
through their rows and columns etc.

Any assistance gratefully received as ever.

Best regards,

Mark Rae
 
Jay Douglas said:
Mark,
I would seriously consider using regular expressions to extract the
content you are looking for out of your html string.

http://www.regular-expressions.info/dotnet.html
http://www.ondotnet.com/pub/a/dotnet/2002/03/11/regex2.html

Thanks for the reply. Will that, e.g. allow me to extract all the text
between "<table" and "</table>"?

Alternatively, is there a way to reference a stream of HTML and treat it as
if it were an HTML document from which I could evaluate the tabledefs
collection etc?

Mark
 
Mark,

With regular expressions, you can extract text from all sorts of
different patterns including text in-between table tags.

Now about changing attributes and elements of the html string... I've
seen some examples where html is actually transformed into xml string and
then attributes of certain elements are then modified then returned back to
an html string.

Here's a link to start your research with:

http://www.fawcette.com/vsm/2002_03/online/online_eprods/c_wagner_03_18/
 
message
Jay,
With regular expressions, you can extract text from all sorts of
different patterns including text in-between table tags.

Now about changing attributes and elements of the html string... I've
seen some examples where html is actually transformed into xml string and
then attributes of certain elements are then modified then returned back to
an html string.

Here's a link to start your research with:

http://www.fawcette.com/vsm/2002_03/online/online_eprods/c_wagner_03_18/

Thanks for this. I looked at it, and found that it was more than I needed.

In the end, I extracted the various <tr>...</tr> lines out of the HTML
stream, and then processeded them with the standard Substring() and
IndexOf() methods of the String object.

Job done.

Best,

Mark
 
Jay said:
Mark,

With regular expressions, you can extract text from all sorts of
different patterns including text in-between table tags.

Not really. You cannot match corresponding opening and closing tags for
example, because there's no way to express such constructs using regular
expressions (see context-free grammars).

I'd rather use a real parser such as the Chris Lovett's SGML parser.

http://www.gotdotnet.com/Community/...mpleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC

Cheers,
 
Not really. You cannot match corresponding opening and closing tags for
example, because there's no way to express such constructs using regular
expressions (see context-free grammars).

I'm having no problems thus far extracting strings between the following
tags:

<tr>...</tr>
<td>...</td>
I'd rather use a real parser such as the Chris Lovett's SGML parser.
http://www.gotdotnet.com/Community/...mpleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC

Very useful!

Mark
 
Mark said:
I'm having no problems thus far extracting strings between the
following tags:

<tr>...</tr>
<td>...</td>
<p>...</p>

Sure, provided they all match nicely. What about <p><p></p> ;-)
 
Back
Top