parsing problem

ocengiz0 · May 11, 2009

Hi 2 everyone,
i want to implement a parser that will always divide a string
according to the requirements.
i want to explain this by giving examples.
The main string formed of specific characters. these are '(' and 'a'
and 'o' and ')' . between these chars there are some id's like C1 C2
C3.. so on

what i need is i want to divide the coming string to 2(always)
Like if the string comes : (C1aC2) i want -> string1: C1 string2: a
string3:C2
or i have more complex: (C1aC2)o(C3oC4) i want -> string1: (C1aC2)
string2: o string3

C3oC4)
more complex: (C1a(C3oC4)) i want -> string1: C1 string2: a string3:
(C3oC4)
or ((C1aC3)oC4) i want -> string1: (C1aC3)string2: o string3:C4)

i wish i could explain myself. i do not have a lot time. please help
me more the algoritm

amdrit · May 11, 2009

Looks like you have three orders of splits:

First on all the "(...)"'s
then on all the "o"'s
finally on "a"'s.

Perhaps the best way to go would be to parse it with a BNF parser. Have a
look at Gold Parser. As long as your rules are solid and static, this
option is very fast and perhaps most complete.

I reckon, that regex would be the next best choice to go with (Create a
generic pattern match function that would return a list<string> of matches
for each level you are looking for).

Finally, brute force string parsing.

amdrit · May 11, 2009

Looks like you have three orders of splits:

First on all the "(...)"'s
then on all the "o"'s
finally on "a"'s.

Perhaps the best way to go would be to parse it with a BNF parser. Have a
look at Gold Parser. As long as your rules are solid and static, this
option is very fast and perhaps most complete.

I reckon, that regex would be the next best choice to go with (Create a
generic pattern match function that would return a list<string> of matches
for each level you are looking for).

Finally, brute force string parsing.

Paul · May 11, 2009

Use XML. But no seriously.

Looks to me like a syntax parser.

1. It appears you are parsing from left to right
2. A statement consists of () or ids seperated by chars a and o
3. () can contain statements

Use recursion to work through on the above rules or whatever rules you have.
Its impossible to help more without the BLL.

Paul · May 11, 2009

Use XML. But no seriously.

Looks to me like a syntax parser.

1. It appears you are parsing from left to right
2. A statement consists of () or ids seperated by chars a and o
3. () can contain statements

Use recursion to work through on the above rules or whatever rules you have.
Its impossible to help more without the BLL.

Ignacio Machin ( .NET/ C# MVP ) · May 11, 2009

Hi,

You could try to use a Regex for the several options, if not you will
have to construct your own parser you might want to see if there is a
parser implementation in C# before you begin, search for LEX C#

Ignacio Machin ( .NET/ C# MVP ) · May 11, 2009

Hi,

You could try to use a Regex for the several options, if not you will
have to construct your own parser you might want to see if there is a
parser implementation in C# before you begin, search for LEX C#

Jeff Johnson · May 11, 2009

Until you can define all the rules explicitly, there's no point in even
worrying about what parsing algorithm you're going to use.

Jeff Johnson · May 11, 2009

Until you can define all the rules explicitly, there's no point in even
worrying about what parsing algorithm you're going to use.

Pavel Minaev · May 11, 2009

Please do. From the examples you give, it is not at all clear how
strings are split. For example, sometimes you split on "a" and "o",
and sometimes you do not, and there's no clear pattern.

Aside from that, the usual way to parse something more complicated
that what regex can handle is to use a parser generator. ANTLR (http://
antlr.org) is powerful but may take time to figure out; COCO/R (http://
www.ssw.uni-linz.ac.at/coco/) is lightweight and easy to use, so long
as your grammar is parseable with its limitations (it probably is in
your case).

Pavel Minaev · May 11, 2009

Please do. From the examples you give, it is not at all clear how
strings are split. For example, sometimes you split on "a" and "o",
and sometimes you do not, and there's no clear pattern.

Aside from that, the usual way to parse something more complicated
that what regex can handle is to use a parser generator. ANTLR (http://
antlr.org) is powerful but may take time to figure out; COCO/R (http://
www.ssw.uni-linz.ac.at/coco/) is lightweight and easy to use, so long
as your grammar is parseable with its limitations (it probably is in
your case).

ocengiz0 · May 11, 2009

Thanks for all the answers. I looked at to your advises but could not
find a solution. You are rigth that i have to give more info about
parsing.
My rules:
coming string always starts with '(' and ends with ')'.
and the 'a' and 'o' chars between id's C1,C2,C3.. are my operands.
after i can parse and get the divided strings i will do an bool
operation like, (C1aC2) string1:C1 operand:a string2:C2 ,do '&'
operation - > C1 & C2 -> 1 & 1 = 1 . This point is not so important.
Lets say that all the operands are 'a'. Mainly the last & last string
will be (C1aC2). because coming string could not be simple. Most
complex could be ((C1aC2)aC3)a(C6a(C4aC5)).
What i want is, as you see if we compare the paranthesis,i have to
find the most middle operand and the strings between it -> string1:
((C1aC2)aC3) operand:a string2

C6a(C4aC5))
then recursively, i will divide new strings until they form the
simplest one like,
((C1aC2)aC3)
string1

C1aC2) operand:a string2:C3
(C6a(C4aC5))
string1:C6 operand:a string2

C4aC5)
((C1aC2)aC3)
then;
(C1aC2)
string1:C1 operand:a string2:C2
(C4aC5)
string1:C4 operand:a string2:C5
this is last point of my parser, i only need to divide string phase by
phase. thanks for reading

ocengiz0 · May 11, 2009

Thanks for all the answers. I looked at to your advises but could not
find a solution. You are rigth that i have to give more info about
parsing.
My rules:
coming string always starts with '(' and ends with ')'.
and the 'a' and 'o' chars between id's C1,C2,C3.. are my operands.
after i can parse and get the divided strings i will do an bool
operation like, (C1aC2) string1:C1 operand:a string2:C2 ,do '&'
operation - > C1 & C2 -> 1 & 1 = 1 . This point is not so important.
Lets say that all the operands are 'a'. Mainly the last & last string
will be (C1aC2). because coming string could not be simple. Most
complex could be ((C1aC2)aC3)a(C6a(C4aC5)).
What i want is, as you see if we compare the paranthesis,i have to
find the most middle operand and the strings between it -> string1:
((C1aC2)aC3) operand:a string2

C6a(C4aC5))
then recursively, i will divide new strings until they form the
simplest one like,
((C1aC2)aC3)
string1

C1aC2) operand:a string2:C3
(C6a(C4aC5))
string1:C6 operand:a string2

C4aC5)
((C1aC2)aC3)
then;
(C1aC2)
string1:C1 operand:a string2:C2
(C4aC5)
string1:C4 operand:a string2:C5
this is last point of my parser, i only need to divide string phase by
phase. thanks for reading

Pavel Minaev · May 12, 2009

Okay, so you do end up dividing them to the very end. And no spaces in
the input either, right?

I think that what you really want to get in the end here is a parse
tree, not just a bunch of strings (it will be easier to process it
later on). In your case, the only three kinds of nodes in the tree
would be "a", "o", and "C". Let's define them:

// Quick & dirty... in production code you'll probably want to make
those immutable, use properties over fields, and add constructors
abstract class Node { }
class ANode : Node { public Node Left; public Node Right; }
class ONode: Node { public Node Left; public Node Right; }
class CNode: Node { public int N; }

Now, your original example:

((C1aC2)aC3)o(C6a(C4aC5))

would be described with the following tree:

new ONode {
Left = new ANode {
Right = new ANode {
Left = new CNode { N = 1 },
Right = new CNode { N = 2 }
},
Left = new CNode { N = 3 }
},
Right = new ANode {
Left = new CNode { N = 6 },
Right = new ANode {
Left = new CNode { N = 4 },
Right = new CNode { N = 5 },
}
}
}

It probably easiest to get the tree with a simple hand-written
recursive descent parser operating directly on strings. Something
along these lines:

static Node Parse(string input)
{
// add a sentinel value for easy parsing.
input += '\0';
int index = 0;
var tree = ParseAOExpr(input, ref index);
// make sure input was fully parsed
if (input[index] != '\0')
{
throw new Exception("Expected end of input at " + index);
}
return tree;
}

// AOExpr is an expression of form "X", "XaY" or "XoY", where X
and Y are CExprs or groups.
static Node ParseAOExpr(string input, ref int index)
{
// parse X
var x = ParseCExprOrGroup(input, ref index);
// see if we have an operator there or not
char op = input[index];
switch (op)
{
case 'a':
case 'o':
{
// consume the operator
++index;
// parse Y;
var y = ParseCExprOrGroup(input, ref index);
if (op == 'a')
{
return new ANode { Left = x, Right = y };
}
else
{
return new ONode { Left = x, Right = y };
}
}

default:
return x;
}
}

// CExpr is an expression of form "Cddd", where ddd is a sequence
of digits of arbitrary length.
// Group is an expression of form "(X)", where X is an AExpr
static Node ParseCExprOrGroup(string input, ref int index)
{
switch (input[index])
{
case '(':
{
// consume opening brace
++index;
// recursively parse what follows the opening
brace
var x = ParseAOExpr(input, ref index);
// index should point at the closing brace now
if (input[index] != ')')
{
throw new Exception("Closing brace expected at
" + index);
}
// consume closing brace
++index;
return x;
}

case 'C':
{
// consume 'C'
++index;
// consume all digits following 'C'
string sn = "";
while (char.IsDigit(input[index]))
{
sn += input[index];
++index;
}
if (sn.Length == 0)
{
throw new Exception("At least one digit
expected at " + index);
}
return new CNode { N = int.Parse(sn) };
}
default:
throw new Exception("Expected 'C' or '(' at " +
index);
}
}

Again, this is quick & dirty - you'd probably want to use a distinct
exception type for error reporting, not hardcode the messages in, etc.

Pavel Minaev · May 12, 2009

Okay, so you do end up dividing them to the very end. And no spaces in
the input either, right?

I think that what you really want to get in the end here is a parse
tree, not just a bunch of strings (it will be easier to process it
later on). In your case, the only three kinds of nodes in the tree
would be "a", "o", and "C". Let's define them:

// Quick & dirty... in production code you'll probably want to make
those immutable, use properties over fields, and add constructors
abstract class Node { }
class ANode : Node { public Node Left; public Node Right; }
class ONode: Node { public Node Left; public Node Right; }
class CNode: Node { public int N; }

Now, your original example:

((C1aC2)aC3)o(C6a(C4aC5))

would be described with the following tree:

new ONode {
Left = new ANode {
Right = new ANode {
Left = new CNode { N = 1 },
Right = new CNode { N = 2 }
},
Left = new CNode { N = 3 }
},
Right = new ANode {
Left = new CNode { N = 6 },
Right = new ANode {
Left = new CNode { N = 4 },
Right = new CNode { N = 5 },
}
}
}

It probably easiest to get the tree with a simple hand-written
recursive descent parser operating directly on strings. Something
along these lines:

static Node Parse(string input)
{
// add a sentinel value for easy parsing.
input += '\0';
int index = 0;
var tree = ParseAOExpr(input, ref index);
// make sure input was fully parsed
if (input[index] != '\0')
{
throw new Exception("Expected end of input at " + index);
}
return tree;
}

// AOExpr is an expression of form "X", "XaY" or "XoY", where X
and Y are CExprs or groups.
static Node ParseAOExpr(string input, ref int index)
{
// parse X
var x = ParseCExprOrGroup(input, ref index);
// see if we have an operator there or not
char op = input[index];
switch (op)
{
case 'a':
case 'o':
{
// consume the operator
++index;
// parse Y;
var y = ParseCExprOrGroup(input, ref index);
if (op == 'a')
{
return new ANode { Left = x, Right = y };
}
else
{
return new ONode { Left = x, Right = y };
}
}

default:
return x;
}
}

// CExpr is an expression of form "Cddd", where ddd is a sequence
of digits of arbitrary length.
// Group is an expression of form "(X)", where X is an AExpr
static Node ParseCExprOrGroup(string input, ref int index)
{
switch (input[index])
{
case '(':
{
// consume opening brace
++index;
// recursively parse what follows the opening
brace
var x = ParseAOExpr(input, ref index);
// index should point at the closing brace now
if (input[index] != ')')
{
throw new Exception("Closing brace expected at
" + index);
}
// consume closing brace
++index;
return x;
}

case 'C':
{
// consume 'C'
++index;
// consume all digits following 'C'
string sn = "";
while (char.IsDigit(input[index]))
{
sn += input[index];
++index;
}
if (sn.Length == 0)
{
throw new Exception("At least one digit
expected at " + index);
}
return new CNode { N = int.Parse(sn) };
}
default:
throw new Exception("Expected 'C' or '(' at " +
index);
}
}

Again, this is quick & dirty - you'd probably want to use a distinct
exception type for error reporting, not hardcode the messages in, etc.

parsing problem

ocengiz0

amdrit

amdrit

Paul

Paul

Ignacio Machin ( .NET/ C# MVP )

Ignacio Machin ( .NET/ C# MVP )

Jeff Johnson

Jeff Johnson

Pavel Minaev

Pavel Minaev

ocengiz0

ocengiz0

Pavel Minaev

Pavel Minaev