P
Passiday
Hi,
I am trying here to make sense of two regex constructs:
1. Non-backtracking subexpression: (?> subexpression)
The official description (The subexpression is fully matched once, and
then does not participate piecemeal in backtracking.) does not make
sense for me. It's hard to understand without an example, which shows
an expression that does match with generic group, but does not match
with non-backtracking group. For example, I hoped to see regex "^.*?(?
following "g", the regex engine would not give up the already consumed
"xxx" for backtracking purposes. I would be happy to understand where
I am mistaken.
2. Balancing group: (?<name1-name2> subexpression)
I am looking for a stable way to extract outer html from text that is
retrieved from a web page. I was planning to load the text in XML
parser and then select the needed nodes using XPath, but I hate to
thing about the need to tidy up the text so that it really is valid
for XML parsing, and also although I would have a matching XML
element, reading the raw original html that's under that matching XML
note would be impossible, because XML property is built up from the
parsed object. So I hoped that it is possible to use the regexes, and
the balancing groups could help in order to select the full outer html
of the matching element (it indeed has a lot of inner elements). So
I'd be grateful for an example how to do it.
The need: extract the full outer html of elements like "<elementName1|
elementName2 class="className1|className2" .. some other
attributes ..>". The matching elements would not contain other
matching elements in their body, but they do contain other html markup
that makes the regex nontrivial. I speculate that the correct regex
would contain the elementName in group that is backreferenced in the
end of the regex (in order to match with correct </elementName>), but
in between there would be some construction of balancing groups that
makes sure that between the outer tags there is a valid html,
including the loose use of <br>.
Thanks,
Passiday
I am trying here to make sense of two regex constructs:
1. Non-backtracking subexpression: (?> subexpression)
The official description (The subexpression is fully matched once, and
then does not participate piecemeal in backtracking.) does not make
sense for me. It's hard to understand without an example, which shows
an expression that does match with generic group, but does not match
with non-backtracking group. For example, I hoped to see regex "^.*?(?
+)" would eat up the first "xxx" and when it could no match thex+)g.*$" fail to match with string "abcxxxdefxxxxghi" (ie, the "(?>x
following "g", the regex engine would not give up the already consumed
"xxx" for backtracking purposes. I would be happy to understand where
I am mistaken.
2. Balancing group: (?<name1-name2> subexpression)
I am looking for a stable way to extract outer html from text that is
retrieved from a web page. I was planning to load the text in XML
parser and then select the needed nodes using XPath, but I hate to
thing about the need to tidy up the text so that it really is valid
for XML parsing, and also although I would have a matching XML
element, reading the raw original html that's under that matching XML
note would be impossible, because XML property is built up from the
parsed object. So I hoped that it is possible to use the regexes, and
the balancing groups could help in order to select the full outer html
of the matching element (it indeed has a lot of inner elements). So
I'd be grateful for an example how to do it.
The need: extract the full outer html of elements like "<elementName1|
elementName2 class="className1|className2" .. some other
attributes ..>". The matching elements would not contain other
matching elements in their body, but they do contain other html markup
that makes the regex nontrivial. I speculate that the correct regex
would contain the elementName in group that is backreferenced in the
end of the regex (in order to match with correct </elementName>), but
in between there would be some construction of balancing groups that
makes sure that between the outer tags there is a valid html,
including the loose use of <br>.
Thanks,
Passiday