Regex puzzle

  • Thread starter Thread starter Alan Pretre
  • Start date Start date
A

Alan Pretre

Can anyone help me figure out a regex pattern for the following input
example:

xxx:a=b,c=d,yyy:e=f,zzz:www:g=h,i=j,l=m

I would want four matches from this:
1. xxx a=b,c=d
2. yyy e=f
3. zzz (empty)
4. www g=h,i=j,l=m

None of the letters here are single letters, but rather placeholders for
arbitrary words. For example,

LTG:LTG=2-41-53-57,JOB:JN=113&&116&125&&127,CPT:CODE=09789,TRATYP=AMBINC-7-A
MBINC/CPTGRP-0-CPTGRP

Would result in:
1. LTG LTG=2-41-53-57
2. JOB JN=113&&116&125&&127
3. CPT CODE=09789,TRATYP=AMBINC-7-AMBINC/CPTGRP-0-CPTGRP

Everything I've come up with so far would require me to iterate over
substrings. It'd be nice to have just a single matching operation. TIA.

-- Alan
 
Give this one a try:

(?n)((?<item>[A-Za-z]+):(?<value>[A-Za-z]+=.*?)?(?=((,[A-
Za-z]+:)|$)))+

For your input, it gives 3 matches, each with "item"
and "value" groups for what comes before and after the
colon.


Brian Davis
www.knowdotnet.com
 
Brian Davis said:
Give this one a try:

(?n)((?<item>[A-Za-z]+):(?<value>[A-Za-z]+=.*?)?(?=((,[A-
Za-z]+:)|$)))+

For your input, it gives 3 matches, each with "item"
and "value" groups for what comes before and after the
colon.

Brian, *very* impressive. It works beautifully. I changed the last term to
(?=(,[A-Za-z]+:)|$))+
since it looked like there were extraneous parentheses. You gave me much to
study. Thanks again.

-- Alan
 
something like this:
static void Main(string[] args)
{
string constant = "LTG:LTG=2-41-53-57,JOB:JN=113&&116&125&&127,CPT:CODE=09789,TRATYP=AMBINC-7-AMBINC/CPTGRP-0-CPTGRP";
Regex reg = new Regex(@"(?'code'\w\w\w):(?'value'[a-zA-Z0-9\-&/]+=[a-zA-Z0-9\-&/]+,?)*");
MatchCollection coll = reg.Matches(constant);
int i = 0;
foreach(Match match in coll)
{
Console.WriteLine(i++ + ". " + match.Groups["code"] + " -- " + match.Value);
}
}
 
How about?
(\w+):([^:]+)?,(\w+):([^:]+)?,(\w+):([^:]+)?

Go to http://www.organicbit.com/regex/fog0000000019.html and get the regex
tool, it's handy for building these things.

The tool helps when you are coding the regex, but it is cumbersome when you
want to verify the correctness of the regex and match, across a large set of
input. For this you would be better off with a unit test app, where you
store an array of (input,output) pairs. Then run the regex on each input
and compare it to the expected output. (Example below)

-Dino


//
// emailValidation.cs
//
// uses a regexp to validate emails.
// This test program uses xml serialization to get the test input,
// including the regexp string and the various emails to test.
//
// references:
// http://homepage.stts.edu/~agushen/script/emailvalidation.html
//
// Fri, 15 Aug 2003 11:28
//

using Ionic.Test.EmailValidation;

namespace Ionic.Test.EmailValidation {

/// <remarks>
/// Represents all the input for the test, including the regex to test,
/// and an array of test cases.
/// </remarks>
[System.Xml.Serialization.XmlRootAttribute("Email.Validation.Input",
Namespace="", IsNullable=false)]
public class TestInput {

/// <remarks/>

[System.Xml.Serialization.XmlElementAttribute(Form=System.Xml.Schema.XmlSche
maForm.Unqualified)]
public string Regexp;

/// <remarks/>

[System.Xml.Serialization.XmlArrayAttribute(Form=System.Xml.Schema.XmlSchema
Form.Unqualified)]
[System.Xml.Serialization.XmlArrayItemAttribute("Case",
Form=System.Xml.Schema.XmlSchemaForm.Unqualified, IsNullable=false)]
public TestCase[] TestList;
}


/// <remarks>
/// This is the type that stores a single test case.
/// We need a bunch of these to verify that the regex works as
/// expected. Each test case has an input and an output. In our
/// case, the input is a string, and the output is a bool value,
/// which indicates whether the Regex should match or not.
/// Other tests will have different input and output.
/// </remarks>
public class TestCase {

/// <remarks/>

[System.Xml.Serialization.XmlElementAttribute(Form=System.Xml.Schema.XmlSche
maForm.Unqualified)]
public string Input;

/// <remarks/>

[System.Xml.Serialization.XmlElementAttribute(Form=System.Xml.Schema.XmlSche
maForm.Unqualified)]
public bool ExpectedOutput;
}


/// <remarks>
/// This is the test app. The main routine de-serializes from
/// an XML file, then runs the tests, comparing the expected
/// (or desired) output with the actual result.
/// </remarks>
public class Tester {

public static void Main() {
string InputPath= "EmailValidationInput.xml";

System.IO.FileStream fs = new System.IO.FileStream(InputPath,
System.IO.FileMode.Open);
System.Xml.Serialization.XmlSerializer s= new
System.Xml.Serialization.XmlSerializer(typeof(TestInput));
TestInput Input= (TestInput) s.Deserialize(fs);
fs.Close();

System.Text.RegularExpressions.Regex regex= new
System.Text.RegularExpressions.Regex (Input.Regexp);

foreach (TestCase tc in Input.TestList) {
System.Console.WriteLine(tc.Input +"\n " + tc.ExpectedOutput + " \\ " +
regex.IsMatch(tc.Input));
}
}
}
}


This is input data. Store this in the XML file that is de-serialized for
this test.

<Email.Validation.Input>
<TestList>
<!--
================================================================== -->
<!-- =================== True test cases
============================== -->
<!--
================================================================== -->

<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>

<!--
================================================================== -->
<!-- =================== False test cases
============================= -->
<!--
================================================================== -->

<Case>
<Input>[email protected]</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected].</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected].</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected].</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>elmo@cloud9</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>9Lives.club.org</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>@club.org</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>

</TestList>
<Regexp>^(\w([\.\-\w]*\w)?)@(\w([\.\-\w]*\w)*\.\w([\.\-\w]*\w)?)$</Regexp>
</Email.Validation.Input>
 
Dino Chiesa said:
How about?
(\w+):([^:]+)?,(\w+):([^:]+)?,(\w+):([^:]+)?

Dino,

Your regex fails (no match) with a simple test, CMD:PARM=X, and I didn't
have much luck with others I tried. For example, my OP had this example,

LTG:LTG=2-41-53-57,JOB:JN=113&&116&125&&127,CPT:CODE=09789,TRATYP=AMBINC-7-A
MBINC/CPTGRP-0-CPTGRP

Your regex gives this result:
1 matches.
Match 1 has 7 groups.
Group 1 =
"LTG:LTG=2-41-53-57,JOB:JN=113&&116&125&&127,CPT:CODE=09789,TRATYP=AMBINC-7-
AMBINC/CPTGRP-0-CPTGRP"
Group 2 = "LTG"
Group 3 = "LTG=2-41-53-57"
Group 4 = "JOB"
Group 5 = "JN=113&&116&125&&127"
Group 6 = "CPT"
Group 7 = "CODE=09789,TRATYP=AMBINC-7-AMBINC/CPTGRP-0-CPTGRP"

But I was looking for something more along the lines of (Group 2 & 3 in each
match are the desired values):
3 matches.
Match 1 has 3 groups.
Group 1 = "LTG:LTG=2-41-53-57"
Group 2 = "LTG"
Group 3 = "LTG=2-41-53-57"
Match 2 has 3 groups.
Group 1 = "JOB:JN=113&&116&125&&127"
Group 2 = "JOB"
Group 3 = "JN=113&&116&125&&127"
Match 3 has 3 groups.
Group 1 = "CPT:CODE=09789,TRATYP=AMBINC-7-AMBINC/CPTGRP-0-CPTGRP"
Group 2 = "CPT"
Group 3 = "CODE=09789,TRATYP=AMBINC-7-AMBINC/CPTGRP-0-CPTGRP"

But thanks for your advice. I will study what you supplied to try to
understand it as well. Thanks!

-- Alan
 
Try the following:

Regex regex = new Regex(@"
( # overall repetition
(?<Item> # Capture to item
(?<Tag>.*?) # Any character, one or more times, non-greedy
: # literal :
.*? # any character, one or more times, non-greedy
) # end of capture
,? # optional "","". This eats the comma between the Items
(?= # optional zero-width lookahead. This must match at this
spot
(\w+: # one or more word characters, followed by a literal :
| # or
$ # end of line
)
)
)+ # one or more times",
RegexOptions.ExplicitCapture |
RegexOptions.Compiled |
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);

The key to this is the zero-width lookahead. It ensures that the part after
the match is either <xxx>:, or the end of the string, without eating any of
the characters. As you've probably found, without this there's no way to
know whether you should include a comma or break on it.



Here's the output I get from my regex workbench:

Matching:
LTG:LTG=2-41-53-57,JOB:JN=113&&116&125&&127,CPT:CODE=09789,TRATYP=AMBINC-7-A
MBINC/CPTGRP-0-CPTGRP
Item => LTG:LTG=2-41-53-57
Item => JOB:JN=113&&116&125&&127
Item => CPT:CODE=09789,TRATYP=AMBINC-7-AMBINC/CPTGRP-0-CPTGRP
Tag => LTG
Tag => JOB
Tag => CPT

--
Eric Gunnerson

Visit the C# product team at http://www.csharp.net
Eric's blog is at http://blogs.gotdotnet.com/ericgu/

This posting is provided "AS IS" with no warranties, and confers no rights.
 
Back
Top