Beginner's Question

Jack · Mar 12, 2009

Hi,
If I'd want to use something like map, pair etc in C#,
How do I go about doing that?

I am implementing a program called HTML_Viewer
its purpose is to view HTML files on a standalone basis
I first read a line from the .html using File.ReadText()
Get the first '<', find the '>' and extract the token
my question is because some tags might intervine
with others. How could I ensure that the info i read in is correct? I was
thinking about using indices to record down start tag and close tag
positions.. But I am not very sure. Could anyone give some pointers on how
to build such system?
Thanks
Jack

Pavel Minaev · Mar 12, 2009

Hi,
If I'd want to use something like map, pair etc in C#,
How do I go about doing that?

If you mean std::map and std:

air from STL, then have a look at
System.Collections.Generic namespace, and specifically

Dictionary said:
I am implementing a program called HTML_Viewer
its purpose is to view HTML files on a standalone basis
I first read a line from the .html using File.ReadText()
Get the first '<', find the '>' and extract the token
my question is because some tags might intervine
with others. How could I ensure that the info i read in is correct? I was
thinking about using indices to record down start tag and close tag
positions.. But I am not very sure. Could anyone give some pointers on how
to build such system?

Building a proper parsing using IndexOf and Substring is almost always
a futile exercise for any moderately complex input, and HTML is
extremely complex. You need to write a proper parser instead. As
usual, Wikipedia is a pretty good starting point: http://en.wikipedia.org/wiki/Parsing
- have a look. For something that is as complex as HTML, you might
have to hand-code an RDP.

On the whole, it is a very complicated task to parse HTML correctly,
especially the "tag soup" kind that's often seen in the wild. Better
use an existing one. If your application is open source, you may want
to consider this one: http://www.codeplex.com/htmlagilitypack (it's
under CC-BY-SA, which is effectively GPL-like). Another option is to
let IE parse the HTML, and then use its DOM API to navigate the tree -
here's one of the many articles on how to do that:
http://www.codeproject.com/KB/IP/parse_html.aspx

Jack · Mar 12, 2009

Thanks both of you for the information.
I'll take your suggestions into account.
Jack

JM · Mar 13, 2009

On the whole, it is a very complicated task to parse HTML correctly

Now there's something I can agree with you about.

On the surface, it looks like it should be simple - and it is. Until you
start to test your assumptions against web page code found in the real
world. Which, is astonishly bad for the most part..

Take it from someone that's written a non-MS based HTML, XHTML,
XML parsing engine that's buried within several widely used applications
and has been pounded on relentlessly over the last 10 years..

It is not a simple task..

John McTaggart

JM · Mar 13, 2009

An alternative approach would be to use a stack data structure (e.g.

Stack<T>) to track the current element being processed (and any associated
information you feel you need to keep track of, such as character position
within the input). That's still basically recursive, just without the
actual method calls (slightly more efficient, but IMHO not worth the
trouble in most cases).

From my own experience, I find a stack of objects (each containing the
complete information for that tag) is much easier to deal with.

push <html>
push <head>
push <title>
peek </title> if not /title it's a nesting error (pop?)
don't push <meta /> because end tag is forbidden
push <style>
peek </style> if not /style it's a nesting error (pop?)

and so on. From there, it's simply a matter of what to do about nesting
issues IF correcting for XHTML compliance.

Just remember, there are many tags with optional end tags. Most web
authors don't realize that the <html></html> and <head></head> tag
combos are optional, but the <title></title> tag combo is required..

http://www.w3.org/TR/REC-html40/struct/global.html

This stack technique is especially useful for XML structure checks
because each tag has to either be closed with an explicit / or have a
corresponding end tag. And unlike HTML where most browsers just
eat the bad code and do their best to keep going, most XML parsers
basically choke on illegal nesting and missing tags..

YMMV

John McTaggart

Pavel Minaev · Mar 13, 2009

Now there's something I can agree with you about.

On the surface, it looks like it should be simple - and it is. Until you
start to test your assumptions against web page code found in the real
world. Which, is astonishly bad for the most part..

Actually, all it takes is looking at the HTML spec in details. You'd
think that the syntax is straightforward and general (even considering
that some elements are self-closing, and some opening elements auto-
close other elements). But even parsing the element tags and content
correctly can be a challenge. For example, do you know how to
correctly parse the contents of <script> and <style> elements
(probably a rhetorical question for you in particular, John, but I'll
leave it standing for the benefit of others

? Have a look:
http://www.w3.org/TR/html4/appendix/notes.html#h-B.3.2 - and note that
not every "<" indicates the start of the element in those contexts.

Take it from someone that's written a non-MS based HTML, XHTML,
XML parsing engine that's buried within several widely used applications
and has been pounded on relentlessly over the last 10 years..

Speaking of which - is that engine you've worked on available for
licensing as a separate library? I have no need for that now, but my
former employer might still be interested - HTML parsing was pretty
much the biggest issue we had in the product back then, and using IE
to parse, while an option, is very slow.

JM · Mar 16, 2009

On the surface, it looks like it should be simple - and it is. Until you

start to test your assumptions against web page code found in the real
world. Which, is astonishly bad for the most part..

Boy, do I know this one! :-)

I remember tackling it many years ago and got to the point where I simply
checked for the entire tag just to be sure. It was a matter of checking
against
overstepping the buffer length and then using a little pointer math to look
forward
and check each of the characters.

If they added up to script or /script (or any case combination) we had a
winner!

Take it from someone that's written a non-MS based HTML, XHTML,
XML parsing engine that's buried within several widely used applications
and has been pounded on relentlessly over the last 10 years..

Unfortunately, there are 2 problems. One, I took it off the market a couple
of years ago and two, it's in the form of a VCL component and not a DLL.
The web site with slightly outdated help files are still online..

http://www.compnet101.com/atagparser

Although, one of the ways I wanted to learn the intracies of the C# way of
objects, events and string handling was to rewrite the logic, which I know
from
experience (a posteriori if you will ;-)) works well. It's basically a
collection of
state machines tuned to the different pieces of a web page..

Which was one of the inspirations for me writing my parser. I just didn't
want to
have to rely on MS libraries existing on a machine in order to parse a page.

John McTaggart

Displaying unparsed HTML in a WebBrowser control	2	Feb 25, 2008
The table HTML element	2	May 6, 2011
WebBrowser questions!	1	Apr 6, 2006
Diffrent fonts in one label	2	May 4, 2005
non-backtracking subexpression	1	Jan 2, 2010
Search and replace HTML tags in Word 2003	2	Dec 27, 2009
Syntax for regular expression to highlight text in HTML string	2	Sep 22, 2005
Parsing HTML pages	2	Mar 10, 2006

Beginner's Question

Jack

Pavel Minaev

Jack

JM

JM

Pavel Minaev

JM

Ask a Question

Similar Threads