String Manipulation Question - Can RegEx Do This?

  • Thread starter Thread starter Franklin
  • Start date Start date
F

Franklin

Using .NET 3.5...

My understanding is that RegEx is powerful enough to solve most of the
world's problems...so I'm optimistic about this scenario:

I need to automate a not-so-straight-forward search and replace operation on
strings that contain HTML markup fragments.

I need to take a string like this:
<td align="center"><asp:PlaceHolder ID="PlaceHolder75"
runat="server"></asp:PlaceHolder></td>

and make it into this:
<td align="center" ID="PlaceHolder75"></td>

Important points are these:

1. the <asp:PlaceHolder... /> is being removed entirely, with nothing
inserted in it's place.

2. the <td> located immediately before the <asp:PlaceHolder /> gets the
"ID=" value of the [removed] <asp:PlaceHolder />.
none of the <td> tags already have any ID attribute (this fact should
simplify the operation).

3. A given input string may have multiple <asp:PlaceHolder /> controls - all
of which need to be removed, with the ID attribute of each being inserted
into the <TD> immediately preceeding the [removed] <asp:PlaceHolder />

So, from those of you with significant regex experience, can regex do this?
Any pointers are greatly appreciated. Sample code would be awesome, as
learning regex is a huge task that I've started, but yet have a long way to
go.

- F
 
Peter Duniho said:
Using .NET 3.5...

My understanding is that RegEx is powerful enough to solve most of the
world's problems...so I'm optimistic about this scenario:

As powerful as RegEx is, the world's problems are almost uniformly so
difficult so as to preclude any programming technique from being able to
solve them.
I need to automate a not-so-straight-forward search and replace
operation on
strings that contain HTML markup fragments.

I need to take a string like this:
<td align="center"><asp:PlaceHolder ID="PlaceHolder75"
runat="server"></asp:PlaceHolder></td>

and make it into this:
<td align="center" ID="PlaceHolder75"></td>

Important points are these:

1. the <asp:PlaceHolder... /> is being removed entirely, with nothing
inserted in it's place.

2. the <td> located immediately before the <asp:PlaceHolder /> gets the
"ID=" value of the [removed] <asp:PlaceHolder />.
none of the <td> tags already have any ID attribute (this fact should
simplify the operation).

3. A given input string may have multiple <asp:PlaceHolder /> controls -
all
of which need to be removed, with the ID attribute of each being inserted
into the <TD> immediately preceeding the [removed] <asp:PlaceHolder />

You should be more specific about how you intend for multiple
"PlaceHolder" IDs to be added to the <td> element. Are these to be
combined into a single string for one ID attribute? If so, how is the
string formatted? If not, how?
So, from those of you with significant regex experience, can regex do
this?
Any pointers are greatly appreciated. Sample code would be awesome, as
learning regex is a huge task that I've started, but yet have a long way
to
go.

I'm not a RegEx expert, so don't have an answer off the top of my head. I
do know that RegEx supports grouping, repetitive patterns, and retrieving
match groups and using them in the replacement pattern, so I'd agree that
what you're trying to do could probably be done with RegEx.

But are you sure that's the best way? You are dealing with XML structure
here, and it seems like it might be better to represent the solution as
something that deals with XML structure. For example, just use the
classes in System.Xml.Linq to manipulate your document tree.
Alternatively, you could create an XSLT transform and transform the
document that way (System.Xml.Xsl).

I'm dealing with xhtml fragments, so it might be difficult to do this with
techniques that require an entire or well-formed xml document.

Meanwhile, I'm cobbling something together with RegEx... I'll post it when
completed (then hopefully get some good feedback on improving it).

- F
 
My understanding is that RegEx is powerful enough to solve most of the
world's problems...so I'm optimistic about this scenario:

I need to automate a not-so-straight-forward search and replace operationon
strings that contain HTML markup fragments.

I need to take a string like this:
   <td align="center"><asp:PlaceHolder ID="PlaceHolder75"
runat="server"></asp:PlaceHolder></td>

and make it into this:
   <td align="center" ID="PlaceHolder75"></td>

Important points are these:

1. the <asp:PlaceHolder... /> is being removed entirely, with nothing
inserted in it's place.

2. the <td> located immediately before the <asp:PlaceHolder  /> gets the
"ID=" value of the [removed] <asp:PlaceHolder />.
    none of the <td> tags already have any ID attribute (this fact should
simplify the operation).

3. A given input string may have multiple <asp:PlaceHolder /> controls - all
of which need to be removed, with the ID attribute of each being inserted
into the <TD> immediately preceeding the [removed] <asp:PlaceHolder />

So, from those of you with significant regex experience, can regex do this?
Any pointers are greatly appreciated. Sample code would be awesome, as
learning regex is a huge task that I've started, but yet have a long way to
go.

For the specific task that you've outlined, it probably can be done.
However, the result will most likely be a hack anyway, and here's why.

Given that it's HTML/ASP you're essentially parsing, to do it
_properly_, you have to handle all valid cases, unless you can somehow
guarantee that your input is _precisely_ as you've described, and not
just its semantic quivalent - and usually it's pretty damn hard to do,
esp. if it is external input! For example, you'd probably need to
support single quotes alongside double ones, case-insensitivity,
arbitrary whitespace, possibility of additional attributes alongside
"align", possibility of character entities in attribute values (e.g.
<td align="Center">) - and hey, while we're at it, consider also
custom named entities and external DTDs!

If you are parsing HTML that you did not yourself produce, then most
likely you cannot truly guarantee any of the above (at best, you can
convince yourself that "no-one would do things in such a weird way").
If you are producing it yourself, then you still get a very non-
obvious and brittle coupling - later on you add class="foo" to those
TDs, forgetting about your regex code, and it all breaks - worse yet,
it breaks silently, because Regex.Replace won't complain if it doesn't
find anything to replace.

I've did quite a bit of regex hacking on my own in the past, and some
of it was specifically for HTML parsing, where HTML was internal
input. It was the area of the product which generated the most bugs
for us post-release, and, after some struggling, and regexes growing
more and more messy and complicated and unreadable (and, as we
inevitably kept finding out, still incorrect in some corner cases!),
we scrapped the whole thing entirely and just wrote a proper parser.
 
<snip>

I do have complete control over the inputs.

The only meaningful variation between what I stated and posted in the OP and
what I'll have to deal with in the application is that the ID value of the
PlaceHolder controls will change/be unique. Note that there will possibly be
multiple PlaceHolders declared within any given input.

All I really need to do is exactly what's stated in the OP and presented in
the sample strings in the OP. I'm only needing to search for
"<asp:PlaceHolder..." tags, and take the ID of each found Placeholder and
stick it into the preceeding <TD>.

I'm close to what I need using .NET's RegEx Replace:
string result = Regex.Replace(text1,
"><asp:PlaceHolder.*</asp:PlaceHolder>", new
MatchEvaluator(GetRevisedTdTag), RegexOptions.IgnoreCase);

The only problem I'm having is that in the above regex match, the .*
part is causing it to match the start of the first ><asp:PlaceHolder
instance and the close of the very last </asp:PlaceHolder> found in an input
that has multiple PlaceHolders defined within it.

I suspect that for somebody with substantial regex knowledge, it would be
trivial to cause it to match each ><asp:PlaceHolder... individually. Can you
help with that part?

Thanks.
 
Franklin <[email protected]> said:
<snip>

I do have complete control over the inputs.

The only meaningful variation between what I stated and posted in the OP and
what I'll have to deal with in the application is that the ID value of the
PlaceHolder controls will change/be unique. Note that there will possibly be
multiple PlaceHolders declared within any given input. -snip-
The only problem I'm having is that in the above regex match, the .*
part is causing it to match the start of the first ><asp:PlaceHolder
instance and the close of the very last </asp:PlaceHolder> found in an input
that has multiple PlaceHolders defined within it.

I suspect that for somebody with substantial regex knowledge, it would be
trivial to cause it to match each ><asp:PlaceHolder... individually. Can you
help with that part?

Don't have it handy to test the full expression at the moment, but what
you're wanting is the non-greedy version of .* which is .*?

Normally * takes as many characters as it can and still match the
expression, adding the ? causes it to stop as soon as it's found it's
match.

Frex, given the string "small world, a really, really tiny world", the
expression "small.*world" will match the entire string, but
"small.*?world" will match just the first two words.
 
Don't have it handy to test the full expression at the moment, but what
you're wanting is the non-greedy version of .* which is .*?


That's exactly what I needed. Thanks!!!!


-F
 
Back
Top