Regex greedy/lazy problem

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I have a scenario where a string is sent in chunks to my app. I need to be
able to identify certain tags in this partial string as it arrives.
eg
<DALFile>xxxxxxxxx</DALFile>

I need to be able to have a regex that will capture the start, middle and
end of this file based on the tags. The problem is that the end tag may not
always be present, and the content (xxxxxxx in this case) may contain
carriage returns.

My attempt so far would be along the lines:
(?<startTag><DALFile>)(?<content>.*)<?<endTag></DALFile>)?

This will work but will return the </DALFile> in the <content> group.
Making the .* lazy (i.e. .*?) will work but only if the end tag is present,
which may not always be the case as the string is chunked.

The following also works but if there's a carriage return in xxxxxx it does
not return all the content:
(?<startTag><DALFile>)(?<content>[^</DALFile>]*)(?<endTag></DALFile>)?

Would someone be able to point out a way that would suit all scenarios?
Thanks in advance.
 
You really haven't clarified your rules. Several things are not clear.

Will the "chunks" ever split the tags themselves?

What sort of characters may be in the "content" between the tags?

Making a couple of assumptions, I came up with the following:

(?:(?<startTag><DALFile>))?(?<content>[^<]*)(?:(?<endTag></DALFile>))?

This can be broken up into 3 sections:

(?:(?<startTag><DALFile>))?

0 or 1 sequence of "<DALFile>" - assumption that it is never broken.

(?<content>[^<]*)

0 or more characters that are NOT '<' - assumption that the '<' character
may not appear between the tags.

(?:(?<endTag></DALFile>))?

0 or 1 sequences of "</DALFile>" - assumption that it is never broken.

Why are you not waiting until you get all of the string to parse it, rather
than attempting to parse "chunks?"

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
 
Hi Kevin, and thanks for the response.

Yes - I was concerned about over complicating the message so I omitted a few
rules.
Basically I have a series of files transferred through sockets and the
receiving socket is parsing the data as it arrives - and is not waiting for
the whole stream to arrive.

The files may be either ascii or binary but all transferred as binary. The
receiving socket usese the GetString method on the byte array and parses that
when it determines that the start/end of a file is in the current chunk. So
there may well be angle brackets inside the string in addition to those
introduced by the sending socket.

Yes, the tags may be split but I can handle the case when there is no match
easilly enough.

I've taken a look at your solution and it doesn't appear to handle newline
characters for the content. From my reading it appears that the DOT can treat
carriage returns as characters but am unsure what other constructs are
available for this.

Thanks again for the reply.

Sean



Kevin Spencer said:
You really haven't clarified your rules. Several things are not clear.

Will the "chunks" ever split the tags themselves?

What sort of characters may be in the "content" between the tags?

Making a couple of assumptions, I came up with the following:

(?:(?<startTag><DALFile>))?(?<content>[^<]*)(?:(?<endTag></DALFile>))?

This can be broken up into 3 sections:

(?:(?<startTag><DALFile>))?

0 or 1 sequence of "<DALFile>" - assumption that it is never broken.

(?<content>[^<]*)

0 or more characters that are NOT '<' - assumption that the '<' character
may not appear between the tags.

(?:(?<endTag></DALFile>))?

0 or 1 sequences of "</DALFile>" - assumption that it is never broken.

Why are you not waiting until you get all of the string to parse it, rather
than attempting to parse "chunks?"

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

sbparsons said:
I have a scenario where a string is sent in chunks to my app. I need to be
able to identify certain tags in this partial string as it arrives.
eg
<DALFile>xxxxxxxxx</DALFile>

I need to be able to have a regex that will capture the start, middle and
end of this file based on the tags. The problem is that the end tag may
not
always be present, and the content (xxxxxxx in this case) may contain
carriage returns.

My attempt so far would be along the lines:
(?<startTag><DALFile>)(?<content>.*)<?<endTag></DALFile>)?

This will work but will return the </DALFile> in the <content> group.
Making the .* lazy (i.e. .*?) will work but only if the end tag is
present,
which may not always be the case as the string is chunked.

The following also works but if there's a carriage return in xxxxxx it
does
not return all the content:
(?<startTag><DALFile>)(?<content>[^</DALFile>]*)(?<endTag></DALFile>)?

Would someone be able to point out a way that would suit all scenarios?
Thanks in advance.
 
You can use the dot to match a newline by preceding the expression with
"(?s)" - the regular expression for "dot matches new line," as in the
following:

(?s)(?:(?<startTag><DALFile>))?(?<content>.*)(?<endTag></DALFile>)?

The problem here is that the "content" group will now absorb the entire
remaining part of the string.

In addition, one of your conditions makes the situation highly problematic:
there may well be angle brackets inside the string in addition to those
introduced by the sending socket.

I suspected that you might simply be trying to parse each bit that comes
through, and I think the solution is a compromise on your original
requirement. Parse the text in chunks that begin and end with the beginning
and ending tags. That is, don't attempt to use a regular expression until
you have a string ending with the end tag. This can be done by using a
second string buffer and putting each chunk received into it. When the end
tag is in a chunk, you put only the part of the chunk that ends in the end
tag, then parse the resulting string and continue receiving.

Here's why. Imagine a section that comes through as follows:

<DALFile>xxxxxxxxx</DAL

How do you identify the content?

According to your requirements, the following would be a legitimate element,
as you've said that right angle brackets may appear prior to the end tag:

<DALFile>xxxxxxxxx</DAL</DALFile>

Again, what if a chunk comes through as follows:

LFILE>xxx

The only way to ensure that you have a complete element is to get a complete
element to parse. In that case, you can use:

(?s)(?:(?<startTag><DALFile>))(?<content>.*)(?<endTag></DALFile>)

This requires that both the start and end tags are present, and will match
correctly.

If you have more than one tag, you can use a more generic approach:

(?s)(?:(?<startTag><([^>]+)>))(?<content>.*)(?<endTag></\1>)

This identifies the tag name of the start tag with a numbered capturing
group, and uses a reference to that tag name in the end capturing group.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

sbparsons said:
Hi Kevin, and thanks for the response.

Yes - I was concerned about over complicating the message so I omitted a
few
rules.
Basically I have a series of files transferred through sockets and the
receiving socket is parsing the data as it arrives - and is not waiting
for
the whole stream to arrive.

The files may be either ascii or binary but all transferred as binary. The
receiving socket usese the GetString method on the byte array and parses
that
when it determines that the start/end of a file is in the current chunk.
So
there may well be angle brackets inside the string in addition to those
introduced by the sending socket.

Yes, the tags may be split but I can handle the case when there is no
match
easilly enough.

I've taken a look at your solution and it doesn't appear to handle newline
characters for the content. From my reading it appears that the DOT can
treat
carriage returns as characters but am unsure what other constructs are
available for this.

Thanks again for the reply.

Sean



Kevin Spencer said:
You really haven't clarified your rules. Several things are not clear.

Will the "chunks" ever split the tags themselves?

What sort of characters may be in the "content" between the tags?

Making a couple of assumptions, I came up with the following:

(?:(?<startTag><DALFile>))?(?<content>[^<]*)(?:(?<endTag></DALFile>))?

This can be broken up into 3 sections:

(?:(?<startTag><DALFile>))?

0 or 1 sequence of "<DALFile>" - assumption that it is never broken.

(?<content>[^<]*)

0 or more characters that are NOT '<' - assumption that the '<' character
may not appear between the tags.

(?:(?<endTag></DALFile>))?

0 or 1 sequences of "</DALFile>" - assumption that it is never broken.

Why are you not waiting until you get all of the string to parse it,
rather
than attempting to parse "chunks?"

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

sbparsons said:
I have a scenario where a string is sent in chunks to my app. I need to
be
able to identify certain tags in this partial string as it arrives.
eg
<DALFile>xxxxxxxxx</DALFile>

I need to be able to have a regex that will capture the start, middle
and
end of this file based on the tags. The problem is that the end tag may
not
always be present, and the content (xxxxxxx in this case) may contain
carriage returns.

My attempt so far would be along the lines:
(?<startTag><DALFile>)(?<content>.*)<?<endTag></DALFile>)?

This will work but will return the </DALFile> in the <content> group.
Making the .* lazy (i.e. .*?) will work but only if the end tag is
present,
which may not always be the case as the string is chunked.

The following also works but if there's a carriage return in xxxxxx it
does
not return all the content:
(?<startTag><DALFile>)(?<content>[^</DALFile>]*)(?<endTag></DALFile>)?

Would someone be able to point out a way that would suit all scenarios?
Thanks in advance.
 
Back
Top