K
Kelsang Wangchuk
Hi
I'm working on a method that will transform an HTML fragment into an
XHTML fragment. I thought I'd have a go at using regular expressions
in .NET, because I don't really understand the numerous regular
expression classes in .NET and I thought it might be a good
opportunity to learn.
I want to break down the HTML into element names, attribute names,
attribute values and element content. The regular expression I came up
with, that successfully matches fragments such as <div contenteditable
id="test">, looks like this:
^(?<precedingText>[^<>]*)((((<(?<elementName>[a-zA-Z_][a-zA-Z0-9_]*)\s*((?<=\s)(?<attributeName>[a-zA-Z_][a-zA-Z0-9_]*)((\s*=\s*""(?<attributeValue>[^""]*)""\s*)|(\s*=(?<attributeValue>[^\s"">]+)\s*)|(\s*(?!\s*=))))*\s*>))|(</(?<closingElementName>[a-zA-Z_][a-zA-Z0-9_]*)\s*>))(?<text>[^<>]*))*$
A bit complicated I know, but it works. However, now I don't know how
to extract the various elementName, attributeName etc. group values in
order to match them up. All seems a bit wild to me.
Can anyone tell me the best way to construct a regular expression in
order to produce the desired result? I suspect using constructs such
as (?<elementName>....)* doesn't work so well because of the repeat,
but how else could it be done?
Cheers, Wangchuk
I'm working on a method that will transform an HTML fragment into an
XHTML fragment. I thought I'd have a go at using regular expressions
in .NET, because I don't really understand the numerous regular
expression classes in .NET and I thought it might be a good
opportunity to learn.
I want to break down the HTML into element names, attribute names,
attribute values and element content. The regular expression I came up
with, that successfully matches fragments such as <div contenteditable
id="test">, looks like this:
^(?<precedingText>[^<>]*)((((<(?<elementName>[a-zA-Z_][a-zA-Z0-9_]*)\s*((?<=\s)(?<attributeName>[a-zA-Z_][a-zA-Z0-9_]*)((\s*=\s*""(?<attributeValue>[^""]*)""\s*)|(\s*=(?<attributeValue>[^\s"">]+)\s*)|(\s*(?!\s*=))))*\s*>))|(</(?<closingElementName>[a-zA-Z_][a-zA-Z0-9_]*)\s*>))(?<text>[^<>]*))*$
A bit complicated I know, but it works. However, now I don't know how
to extract the various elementName, attributeName etc. group values in
order to match them up. All seems a bit wild to me.
Can anyone tell me the best way to construct a regular expression in
order to produce the desired result? I suspect using constructs such
as (?<elementName>....)* doesn't work so well because of the repeat,
but how else could it be done?
Cheers, Wangchuk