C# and reading websites; parsing HTML

  • Thread starter Thread starter Hans Kamp
  • Start date Start date
H

Hans Kamp

Is it possible to write a function like the following:

string ReadURL(string URL)
{
....
}

The purpose is that it reads the URL (determined by the parameter) and
returns the string in which there is HTML-code, for example:

string websiteContents;

websiteContents = ReadURL("http://www.microsoft.com");

processHTMLCode(websiteContents);
....

Is there also functions that can parse HTML-code in a given string?

Hans Kamp.
 
Justin Rogers said:
First, it is very simple to get the contents of an URL but you have to do
some extra work...

string contents = null;
string url = "http://www.microsoft.com";
HttpWebRequest wreq = (HttpWebRequest) WebRequest.Create(url);
using(HttpWebResponse wresp = (HttpWebResponse) wreq.GetResponse()) {
using(StreamReader sr = new StreamReader(wresp.GetResponseStream())) {
contents = sr.ReadToEnd();
sr.Close();
}
wresp.Close();
}

You could easily place that inside of a function call to make it a bit
easier. You also need to be aware of encodings, but for the most part the
StreamReader should handle that for you.

Aha, thanks. The use of "using" (besides importing libraries as at the
beginning of a C# source code) is a bit new to me. If I understand it
correctly, it declares and initializes the part between ( and )
immediately after "using", and try to execute the statements between {
and }. Exceptions are suppressed but in case of it, the initialized
variable is disposed. Is that correct?
Now for parsing the HTML, you have
two options. You can either custom parse the HTML using regular expressions
or you can try to load it into an XML DOM.

Are there URLs that explain that?
If you know the site is XHTML
compliant then you won't have any problems loading it into an XML DOM. Many
sites that have converted to ASP .NET actually return XHTML compliant code
so good luck with whatever site you are trying to parse.

I want to parse a forum site. To be more specific: There is an attempt
to count from 1 to 10,000,000. And with the speed of sending messages
to the forum I want to calculate the estimated finishing date.

Hans Kamp.
 
Justin Rogers said:
First, it is very simple to get the contents of an URL but you have to do
some extra work...

string contents = null;
string url = "http://www.microsoft.com";
HttpWebRequest wreq = (HttpWebRequest) WebRequest.Create(url);
using(HttpWebResponse wresp = (HttpWebResponse) wreq.GetResponse()) {
using(StreamReader sr = new StreamReader(wresp.GetResponseStream())) {
contents = sr.ReadToEnd();
sr.Close();
}
wresp.Close();
}

You could easily place that inside of a function call to make it a bit
easier. You also need to be aware of encodings, but for the most part the
StreamReader should handle that for you. Now for parsing the HTML, you have
two options. You can either custom parse the HTML using regular expressions
or you can try to load it into an XML DOM.

BTW, I discovered that
http://www.3dmirc.com/apron/tutorials/cs/tutorial5/tutorial.htm gives
useful steps how to parse an XML file. I think this is also useful for
parsing an HTML file, since HTML can be considered as an XML
application.

Hans Kamp.
 
If it is xhtml, anyway, some HTML is not xml compliant and will likely cause
errors.
the mshtml DOM may be of use otherwise. (reference Microsoft.mshtml.dll in
your references dialog)
 
Daniel O'Connell said:
If it is xhtml, anyway, some HTML is not xml compliant and will likely cause
errors.
the mshtml DOM may be of use otherwise. (reference Microsoft.mshtml.dll in
your references dialog)

How do you do that with C#Builder?

I have done some programming with TreeViews and XML Documents:

private void showXmlNodeAtTreeNode(XmlNodeList xnl, TreeNode tn)
{
int i;

for (i = 0; i < xnl.Count; i++) // how many nodes are there in the XML
document?
{
XmlNode xn = xnl; // take the next node
XmlNodeType nodeType = xn.NodeType; // determine its type
if (nodeType == XmlNodeType.Element) // is it an element?
{
tn.Nodes.Add("Element: " + xn.Name); // add its name in the tree view
showXmlNodeAtTreeNode(xn.ChildNodes, tn.Nodes); // add the XML child
nodes to this node
} else
if (nodeType == XmlNodeType.Text) // is it text?
{
tn.Nodes.Add("Text: " + xn.InnerText); // yes? then add it to the node
}
}
}

private void parseButton_Click(object sender, System.EventArgs e)
{
XmlDocument xd = new XmlDocument();

xd.LoadXml(xmlBox.Text); // load the text from a MultiLine EditBox

xmlView.Nodes.Clear(); // clear the TreeView

xmlView.Nodes.Add("Start"); // add "Start" at the root of the tree.

showXmlNodeAtTreeNode(xd.ChildNodes, xmlView.Nodes[0]); // add the XML
child nodes to the first TreeView nodes, and do that using recursion.

}

It could word with HTML but it is very strict. A small HTML syntax error can
crash the program, because no exceptions are catched.

I do mention it, because other newbies have an idea:
- how an XML document is parsed;
- how the XML nodes are read;
- how the treeview nodes are programmed.

Hans Kamp.
 
Daniel O'Connell said:
some HTML will not work in an xml parser, because elements aren't closed or
attributes aren't handled properly, which will fail in stndard xml readers
other bits inline

I have noticed (possibly wrongly), that newer versions of HTML - I
believe - 4.0 can have the modifier "strict" in the beginning, and
then they have to be according to the XML syntax.
i don't precisely understand what you mean here

It partly has to do with my own behaviour in newsgroups with a
teaching/learning purpose like this one.

For me there are two ways of finding the answer to a specific
question. I can start a thread and wait for answers that others give.
But I can lurk in the older threads and looking for the questions and
read the answers that are replies to those questions.

Maybe others have the same attitude. I mean, if others wants to know
how to parse XML (although not perfectly at this moment) and how to
add nodes to a TreeView, they can lurk in this thread and learn how
the things have to be programmed.

Hans Kamp.
 
Hans Kamp said:
"Daniel O'Connell" <[email protected]> wrote in message

I have noticed (possibly wrongly), that newer versions of HTML - I
believe - 4.0 can have the modifier "strict" in the beginning, and
then they have to be according to the XML syntax.

I am not to sure(I am not an HTML expert), but I know SOME HTML will parse
ok. XHTML surely. The problem is you can't really rely on whatever site you
want to nessecerily support a version of HTML that works.
It partly has to do with my own behaviour in newsgroups with a
teaching/learning purpose like this one.

For me there are two ways of finding the answer to a specific
question. I can start a thread and wait for answers that others give.
But I can lurk in the older threads and looking for the questions and
read the answers that are replies to those questions.

Maybe others have the same attitude. I mean, if others wants to know
how to parse XML (although not perfectly at this moment) and how to
add nodes to a TreeView, they can lurk in this thread and learn how
the things have to be programmed.

I prefer to read through the newsgroups, myself. Surprisingly i have only
posted about 3 questions to the groups in the last year, all of which i
ended up answering myself, either through reading back or discovering my own
bug before anyone else.

I just didn't quite understand the reasoning behind your post, i do now,
lol.
 
FYI, I recently downloaded the C# Builder personal edition and foudn that
borland provides a Borland.mshtml.dll assembly, its in the root of the C#
builder install folder, that should provide what you need from mshtml, if
you should choose to use it.
 
Back
Top