Regular expressions and HTML

  • Thread starter Thread starter Kelsang Wangchuk
  • Start date Start date
K

Kelsang Wangchuk

Hi

I'm working on a method that will transform an HTML fragment into an
XHTML fragment. I thought I'd have a go at using regular expressions
in .NET, because I don't really understand the numerous regular
expression classes in .NET and I thought it might be a good
opportunity to learn.

I want to break down the HTML into element names, attribute names,
attribute values and element content. The regular expression I came up
with, that successfully matches fragments such as <div contenteditable
id="test">, looks like this:

^(?<precedingText>[^<>]*)((((<(?<elementName>[a-zA-Z_][a-zA-Z0-9_]*)\s*((?<=\s)(?<attributeName>[a-zA-Z_][a-zA-Z0-9_]*)((\s*=\s*""(?<attributeValue>[^""]*)""\s*)|(\s*=(?<attributeValue>[^\s"">]+)\s*)|(\s*(?!\s*=))))*\s*>))|(</(?<closingElementName>[a-zA-Z_][a-zA-Z0-9_]*)\s*>))(?<text>[^<>]*))*$

A bit complicated I know, but it works. However, now I don't know how
to extract the various elementName, attributeName etc. group values in
order to match them up. All seems a bit wild to me.

Can anyone tell me the best way to construct a regular expression in
order to produce the desired result? I suspect using constructs such
as (?<elementName>....)* doesn't work so well because of the repeat,
but how else could it be done?

Cheers, Wangchuk
 
There was an SGML parser on MSDN but I can't find the link anymore... But
that's what you could use, it's a lot easier and cleaner solution to your
problem...

Jerry

Kelsang Wangchuk said:
Hi

I'm working on a method that will transform an HTML fragment into an
XHTML fragment. I thought I'd have a go at using regular expressions
in .NET, because I don't really understand the numerous regular
expression classes in .NET and I thought it might be a good
opportunity to learn.

I want to break down the HTML into element names, attribute names,
attribute values and element content. The regular expression I came up
with, that successfully matches fragments such as <div contenteditable
id="test">, looks like this:
^(?<precedingText>[^<>]*)((((<(?<elementName>[a-zA-Z_][a-zA-Z0-9_]*)\s*((?<=
\s)(?<attributeName>[a-zA-Z_][a-zA-Z0-9_]*)((\s*=\s*""(?<attributeValue>[^""
]*)""\s*)|(\s*=(? said:
A bit complicated I know, but it works. However, now I don't know how
to extract the various elementName, attributeName etc. group values in
order to match them up. All seems a bit wild to me.

Can anyone tell me the best way to construct a regular expression in
order to produce the desired result? I suspect using constructs such
as (?<elementName>....)* doesn't work so well because of the repeat,
but how else could it be done?

Cheers, Wangchuk
 
Ok, great, thanks very much. I'll have a look.

KW

Jerry III said:
Found the link... The whole article is at
http://msdn.microsoft.com/library/en-us/dnxmlnet/html/XMLToolsUpdate.asp and
the SGML reader can be downloaded at
http://www.gotdotnet.com/Community/...mpleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC.
It will allow you to read an SGML document (such as HTML) into an
XmlDocument which is pretty much what you want, isn't it?

Jerry

Jerry III said:
There was an SGML parser on MSDN but I can't find the link anymore... But
that's what you could use, it's a lot easier and cleaner solution to your
problem...

Jerry

Kelsang Wangchuk said:
Hi

I'm working on a method that will transform an HTML fragment into an
XHTML fragment. I thought I'd have a go at using regular expressions
in .NET, because I don't really understand the numerous regular
expression classes in .NET and I thought it might be a good
opportunity to learn.

I want to break down the HTML into element names, attribute names,
attribute values and element content. The regular expression I came up
with, that successfully matches fragments such as <div contenteditable
id="test">, looks like this:
\s)(? said:
]*)""\s*)|(\s*=(?<attributeValue>[^\s"">]+)\s*)|(\s*(?!\s*=))))*\s*>))|(</
(?
A bit complicated I know, but it works. However, now I don't know how
to extract the various elementName, attributeName etc. group values in
order to match them up. All seems a bit wild to me.

Can anyone tell me the best way to construct a regular expression in
order to produce the desired result? I suspect using constructs such
as (?<elementName>....)* doesn't work so well because of the repeat,
but how else could it be done?

Cheers, Wangchuk
 
Back
Top