HTML Parser

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Does Microsft provide any HTML Parser that could search me the img/src
attribute and others similar to it?

If not, are there any third party tools available?
 
"Cor Ligthert" wrote:
Thanks Cor. That was of great help.

Do we have similar facility available with .NET libraries? Or can we convert
HTML to XML and then XMLReader for the same?

-Ocean
 
Silent,

The big difference between HTML and XML is that the first has W3C defined
tags while the last has user defined tags (direct or using a schema).

MSHTML is directly to use in dotNet when you reference that in .Net as
Microsoft.Mshtml.

Use it without a Using/Import, because of the endless interfaces your IDE
will probably almost freeze when you don't do that.

I hope this was the information you were looking for.

Cor
 
Cor Ligthert said:
Silent,

The big difference between HTML and XML is that the first has W3C defined
tags while the last has user defined tags (direct or using a schema).

Its not that big, its just big enough , but XHTML is trying
to bridge the gap:

XHTML™ 1.0 The Extensible HyperText Markup Language (Second
Edition)
http://www.w3.org/TR/xhtml1/

Under section 4 you can find the main obstacles for treating
HTML 4.0 as an XML document:
- XML documents must be well formed
- Attribute values must be quoted.
etc.

It should be possible to load an XHTML document into an
XmlDocument and then use XPath to select all the image
nodes.


'Any fool can write code that a computer can understand.
Good programmers write code that humans can understand.'
Martin Fowler,
'Refactoring: improving the design of existing code', p.15
 
Cor Ligthert said:
Silent,

The big difference between HTML and XML is that the first has W3C defined
tags while the last has user defined tags (direct or using a schema).

MSHTML is directly to use in dotNet when you reference that in .Net as
Microsoft.Mshtml.

Use it without a Using/Import, because of the endless interfaces your IDE
will probably almost freeze when you don't do that.

I hope this was the information you were looking for.

Cor

Addendum to my previous post.

There is an Open Source (W3C license) utility "HTML Tidy"
http://www.w3.org/People/Raggett/tidy/

http://tidy.sourceforge.net/

which can generate XHTML from HTML. So it should be possible
to "pre-process" (reasonable) HTML input and then work with
the resulting output as an XML document (and benefit from
all the other XML related functionality).


'Any fool can write code that a computer can understand.
Good programmers write code that humans can understand.'
Martin Fowler,
'Refactoring: improving the design of existing code', p.15
 
Back
Top