HTML Parser

Guest · Jan 11, 2005

Does Microsft provide any HTML Parser that could search me the img/src
attribute and others similar to it?

If not, are there any third party tools available?

Cor Ligthert · Jan 11, 2005

mshtml
http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/hosting/hosting.asp
I hope this helps a little bit?

Cor

Guest · Jan 11, 2005

"Cor Ligthert" wrote:
Thanks Cor. That was of great help.

Do we have similar facility available with .NET libraries? Or can we convert
HTML to XML and then XMLReader for the same?

-Ocean

Cor Ligthert · Jan 12, 2005

Silent,

The big difference between HTML and XML is that the first has W3C defined
tags while the last has user defined tags (direct or using a schema).

MSHTML is directly to use in dotNet when you reference that in .Net as
Microsoft.Mshtml.

Use it without a Using/Import, because of the endless interfaces your IDE
will probably almost freeze when you don't do that.

I hope this was the information you were looking for.

Cor

UAError · Jan 12, 2005

Cor Ligthert said:
Silent,

The big difference between HTML and XML is that the first has W3C defined
tags while the last has user defined tags (direct or using a schema).

Its not that big, its just big enough , but XHTML is trying
to bridge the gap:

XHTML™ 1.0 The Extensible HyperText Markup Language (Second
Edition)
http://www.w3.org/TR/xhtml1/

Under section 4 you can find the main obstacles for treating
HTML 4.0 as an XML document:
- XML documents must be well formed
- Attribute values must be quoted.
etc.

It should be possible to load an XHTML document into an
XmlDocument and then use XPath to select all the image
nodes.

'Any fool can write code that a computer can understand.
Good programmers write code that humans can understand.'
Martin Fowler,
'Refactoring: improving the design of existing code', p.15

UAError · Jan 16, 2005

Cor Ligthert said:
Silent,

The big difference between HTML and XML is that the first has W3C defined
tags while the last has user defined tags (direct or using a schema).

MSHTML is directly to use in dotNet when you reference that in .Net as
Microsoft.Mshtml.

Use it without a Using/Import, because of the endless interfaces your IDE
will probably almost freeze when you don't do that.

I hope this was the information you were looking for.

Cor

Addendum to my previous post.

There is an Open Source (W3C license) utility "HTML Tidy"
http://www.w3.org/People/Raggett/tidy/

http://tidy.sourceforge.net/

which can generate XHTML from HTML. So it should be possible
to "pre-process" (reasonable) HTML input and then work with
the resulting output as an XML document (and benefit from
all the other XML related functionality).

'Any fool can write code that a computer can understand.
Good programmers write code that humans can understand.'
Martin Fowler,
'Refactoring: improving the design of existing code', p.15

Writing a multipart content parser with .NET	1	Oct 6, 2004
Parser Error	3	Aug 14, 2007
Html parser	3	Sep 11, 2003
I need an html parser	4	Jun 2, 2007
Open Source HTML Parsers in C#	2	Sep 29, 2006
how do I deploy my c# asp.net app to 2000 server?	1	Jun 28, 2004
Parser error in asp.net code	2	Oct 27, 2003
Two web forms in one project?	2	Sep 25, 2003

HTML Parser

Guest

Cor Ligthert

Guest

Cor Ligthert

UAError

UAError

Ask a Question

Similar Threads