Regular Expressions in C#

  • Thread starter Thread starter LordHog
  • Start date Start date
L

LordHog

Hello all,

I am attempting to create a small scripting application to be used
during testing. I extract the commands from the script file I was going
to tokenize the each line as one of the requirements is there one
command per line. I have always wanted to learn Regular Expressions, so
I was hoping I might do this using Regular Expressions. For a fair
number of the command will have the syntax like

Write( 0x123, 0x12, 25, 100 ) <- Write three bytes to address 0x123
Write(varName1, 0x12) <- Write one bytes to address
expressed by the value of
varName1
Read( 0x55, 5 ) <- Write one bytes to address 0x55
Read(0x3456, 0x12) <- Read eighteen bytes to address
0x3456
varName2 = Read( varName1 ) <- Read one byte from address
expressed by the value of varName1
and store that read value to
varName2


I know if I use the regular expression (^[a-zA-Z]*) will find the
initial keywords or variable names which I can perform an initial check
to make sure they are valid or the variable has been declared already,
but the hard part is creating a regular expression to match the various
forms of the syntax. How would I create a regular express for the first
and last script commands? I think with those I can attempt to determine
the others. The spaces between the arguments are optional and may be
omitted if the user so desires.

For the first script command I was attempting to craft one that looks
like..

(^[a-zA-Z]*)('\(')(['0x',0-9][a-zA-Z]*)(',')(['0x',0-9][a-zA-Z]*)

but this obviously doesn't work. Any help is greatly appreciated.

Mark
 
Hi Mark,

For parsing script commands you might consider using a lexical analyser
like CsLex or C# Lex, maybe with a grammer parser such as GPPG.

To match you first command, try something like:
\w+\((\s*0x\d+\s*,\s*{2}\d+\s*,\s*\d+\s*\)

There's a great regexp reference here:
http://www.regular-expressions.info/reference.html

HTH,
Chris
 
I couldn't help but bite on this one. It is a very challenging problem. Here
is your solution:

(?i)(?:(?<function>Write|Read)\s*\()\s*|(?<=(?:(?:Write|Read)\s*\(\s*)|(?:(?:[\d\w]+\s*,\s*)))(?<parameter>[\d\w]+)(?=,\s*|\s*\))

Let me break it down a bit. First, I used (?i) to indicate that it is
non-case-sensitive.
Next, I had the problem of identifying *both* function names and parameters
in the same Regular Expression.

The function name Regular Expression is:

(?:(?<function>Write|Read)\s*\(\s*)

"function" is the name of the capturing group, which captures only the
function name. The rest of the match is to identify it as a function.

It will match only if the function name is "Read" or "Write" and is followed
by an opening parenthesis. I assumed that any token may have any number of
white-space characters before and after it. This was not too tricky.

The second one is a bit trickier:

(?<=(?:(?:Write|Read)\s*\(\s*)|(?:(?:[\d\w]+\s*,\s*)))(?<parameter>[\d\w]+)(?=,\s*|\s*\))

The trick here is to identify a parameter from inside a set of function
parameters.

The rules break down as:

1. A parameter is always preceded by a function name followed by an open
parenthesis, as in:

Write (

2. It may be preceded by another parameter followed by a comma.

Write(param1,

- or -

Write(.......param3,

3. It is always followed by either a comma or an end-parenthesis.

param1,
- or -
param2 )

So, starting with the third rule, we get:

(?<parameter>[\d\w]+)(?=,\s*|\s*\))

"parameter" is the name of the capturing group, which according to these
rules is an alphanumeric token. The rest of it is how the parameter is
matched. It is a positive look-ahead, which means that it *must* be followed
by either a comma or an end parenthesis.

However, the problem here is that *any* word in the string that is not a
function and is followed by a comma or an end parenthesis will match this,
as in:

Read( 0x55, 5 ) <- Write one byte, to (address 0x55)

In this line, "byte," and "(address 0x55)" will match.

So, how do we eliminate non-parameters? Well, obviously, a parameter is
defined as being inside the parentheses of a function call. So, first, use a
positive look-behind to see if it is preceded by a function call. We need to
identify the function, using the same syntax as before:

(?:(?:Write|Read)\s*\(\s*)

However, it may have a parameter before it, instead of the function call. So
we use an OR "|" operator to indicate that it may be preceded by:

(?:(?:[\d\w]+\s*,\s*))

Note that we have changed the rule slightly. Any parameter which precedes
another parameter will *not* be followed by an end-parenthesis. It will
*always* be followed by a comma.

So, we use the Positive Lookbehind syntax (?>=) coupled with an OR operator
("|"), and get:

(?<=(?:(?:Write|Read)\s*\(\s*)|(?:(?:[\d\w]+\s*,\s*)))(?<parameter>[\d\w]+)(?=,\s*|\s*\))

Translated: Match any alphanumeric set of tokens which is followed by either
a comma or an end parenthesis, and is preceded either by a function call or
by another parameter.

Now to put them together, we use the OR operator:

(?i)(?:(?<function>Write|Read)\s*\()\s*|(?<=(?:(?:Write|Read)\s*\(\s*)|(?:(?:[\d\w]+\s*,\s*)))(?<parameter>[\d\w]+)(?=,\s*|\s*\))

The function name will be captured into the "function" group, and all of the
parameters will be captured into the "parameter" group. This could be stated
as:

Match any token that is either "Read" or "Write" followed by an open
parenthesis, and call it "function," OR Match any alphanumeric set of tokens
which is followed by either a comma or an end parenthesis, and is preceded
either by a function call or by another parameter, and call it "parameter."

You sure picked a doozy to start out with!

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Hard work is a medication for which
there is no placebo.

Hello all,

I am attempting to create a small scripting application to be used
during testing. I extract the commands from the script file I was going
to tokenize the each line as one of the requirements is there one
command per line. I have always wanted to learn Regular Expressions, so
I was hoping I might do this using Regular Expressions. For a fair
number of the command will have the syntax like

Write( 0x123, 0x12, 25, 100 ) <- Write three bytes to address 0x123
Write(varName1, 0x12) <- Write one bytes to address
expressed by the value of
varName1
Read( 0x55, 5 ) <- Write one bytes to address 0x55
Read(0x3456, 0x12) <- Read eighteen bytes to address
0x3456
varName2 = Read( varName1 ) <- Read one byte from address
expressed by the value of varName1
and store that read value to
varName2


I know if I use the regular expression (^[a-zA-Z]*) will find the
initial keywords or variable names which I can perform an initial check
to make sure they are valid or the variable has been declared already,
but the hard part is creating a regular expression to match the various
forms of the syntax. How would I create a regular express for the first
and last script commands? I think with those I can attempt to determine
the others. The spaces between the arguments are optional and may be
omitted if the user so desires.

For the first script command I was attempting to craft one that looks
like..

(^[a-zA-Z]*)('\(')(['0x',0-9][a-zA-Z]*)(',')(['0x',0-9][a-zA-Z]*)

but this obviously doesn't work. Any help is greatly appreciated.

Mark
 
Kevin,

Thanks for providing a response and I am sorry for such a long delay
in my follow-up. I found help in the RegEx group which helped out a
great deal. I wanted to share the RegEx that I have thus far. They are
not fully testest, but they are functional for the most part. I used
unnamed groups for just about everything since that is just how I
decided to parse everything out. Perhaps I might change it in the
future if I find this approach problematic.

So here we go...


Syntax format: Write( address, data [, 44] )

\s*Write\s*\((?:\s*(\d+|0x[\dA-Fa-f]+|[a-zA-Z][\da-zA-Z]*){1,1}(?:\s*,\s*(\d+|0x[\dA-Fa-f]+|[a-zA-Z][\da-zA-Z]*))*\s*\))\s*$



Syntax format: [variable3 =] Read( 0x44 [, 44] )

Group 1 : Optional: variable name with equal sign
(e.g. "variable2 =")
Group 2 : Required: Read keyword
Group 3 : Required: Address
Group 4 : Optional: Number of bytes to read starting at 'Address'

^\s*(?:([a-zA-Z][a-zA-z\d]\w*)\s*=\s*){0,1}(?:\s*(Read){1,1}\s*)\((?:\s*(\d+|0x[\dA-Fa-f]+|[a-zA-Z][\da-zA-Z]*)(?:\s*,\s*(\d+|0x[\dA-Fa-f]+|[a-zA-Z][\da-zA-Z]*))*\s*\))\s*$


This one is rather long, but there are multiple cases that I need to
account for. I could have created a RegEx for each individual case,
but I rather have one all encompassing one then check each of the
parameters instead of processing each RegEx which I think would be
slower. For these, you can change byte to short, int and float which
is used in my application.


Syntax format: byte var1

Group 1 : Required: var1
Group 2 : Optional: Not Present
Group 3 : Optional: Not Present
Group 4 : Optional: Not Present
Group 5 : Optional: Not Present
Group 6 : Optional: Not Present
Group 7 : Optional: Not Present

----------------------------------------------

Syntax format: byte var2 = variableNew

Group 1 : Required: var2
Group 2 : Optional: Not Present
Group 3 : Optional: Not Present
Group 4 : Optional: Not Present
Group 5 : Optional: variableNew
Group 6 : Optional: Not Present
Group 7 : Optional: Not Present

----------------------------------------------

Syntax format: byte var3[3] = { 0x11, 0xAA, 0x33 }

Group 1 : Required: var3
Group 2 : Optional: 3
Group 3 : Optional: 0x11
Group 4 : Optional:
Capture 1: 0xAA
Capture 2: 0x33
Group 5 : Optional: Not Present
Group 6 : Optional: Not Present
Group 7 : Optional: Not Present

----------------------------------------------

Syntax format: byte var4[] = { 0x33, 0x444 }

Group 1 : Required: var4
Group 2 : Optional: Not Present
Group 3 : Optional: 0x33
Group 4 : Optional: 0x444
Group 5 : Optional: Not Present
Group 6 : Optional: Not Present
Group 7 : Optional: Not Present

----------------------------------------------

Syntax format: byte var5[5] = 5555

Group 1 : Required: var5
Group 2 : Optional: Not Present
Group 3 : Optional: Not Present
Group 4 : Optional: Not Present
Group 5 : Optional: Not Present
Group 6 : Optional: 5


^\s*byte
(?:\s*([a-zA-Z][\da-zA-Z]*))(?:\[(?:\s*(\d+)\s*)?\]\s*=\s*(?:\s*\{\s*(\d+|0x[\dA-Fa-f]*)(?:\s*,\s*(\d+|0x[\dA-Fa-f]*))*\s*\})|\s*=\s*(?:(\d+|0x[\dA-Fa-f]+|[a-zA-Z][\da-zA-Z]*))|(?:\[(?:\s*(\d+)\s*)?\]\s*=\s*(\d+|0x[\dA-Fa-f]+|[a-zA-Z][\da-zA-Z]*)))?\s*$



Syntax format: SetCommParam(COMn, BaudRate, DataBits, StopBits,
Parity)

Note: The Comm Port and Parity strings are case sensitive

Group 1 : Required: Port Number { COMn }
Group 2 : Required: Baud Rate
Group 3 : Required: DataBits
Group 4 : Required: StopBits
Group 5 : Required: Parity { None, Odd, Even, Mark, Space }

^\s*SetCommParam\s*\(\s*(?:(COM\d+))\s*,\s*(?:(\d+))\s*,\s*(?:([5-8])){1,1}\s*,\s*(?:(1|1.5|2))\s*,\s*(?:(None|Odd|Even|Mark|Space))\s*\)\s*$


I hope this might help someone else in the future. Thanks too all of
the great people on the newsgroups and forums.

Mark
 
Back
Top