parsing c#

colin · Oct 17, 2009

Hi,

I'm trying to parse c# source files and I find the grammar is ambiguous for the following case :-

(a) - 1

depending on what a is - ie wether it is a type or a value, this could be either of :-

(type) unary_expression
(expression) - expression

particularly if a is a static field or member type then a might not be defined at that point in the file, or may be in another file.
is this what distinguishes between context free and other grammars ?
presumably the file(s) might have to be passed twice in this case,
somehow skipping this code, is this how its done normally or am I missing something ?

theres also a few other cases but ive managed to fudge those ok so far.

Colin =^.^=

Tom Dacon · Oct 17, 2009

colin said:
Hi,

I'm trying to parse c# source files and I find the grammar is ambiguous
for the following case :-

(a) - 1

depending on what a is - ie wether it is a type or a value, this could be
either of :-

(type) unary_expression
(expression) - expression

This is not actually ambiguous. If you intend for it to be interpreted as a
cast, the negative value needs to be enclosed in parentheses:
(System.Int32) (-1)
so (a) - 1 unambiguously matches (expression) - expression

Tom Dacon

colin · Oct 17, 2009

Tom Dacon said:
This is not actually ambiguous. If you intend for it to be interpreted as a cast, the negative value needs to be enclosed in
parentheses:
(System.Int32) (-1)
so (a) - 1 unambiguously matches (expression) - expression

thanks but I think it still is as,
(int) -1
is valid becuase -1 is a unary-expression and so does not need parentheses
it also compiles ok.

I think the same confusion also arrises with the & and * operator (address_of and contents_of)

ofc its simply obvious to us humans but,,,

Colin =^.^=

Tom Spink · Oct 17, 2009

colin said:
Hi,

I'm trying to parse c# source files and I find the grammar is ambiguous for the following case :-

(a) - 1

depending on what a is - ie wether it is a type or a value, this could be either of :-

(type) unary_expression
(expression) - expression

Please explain why you think that is ambiguous.

particularly if a is a static field or member type then a might not be defined at that point in the file, or may be in another file.
is this what distinguishes between context free and other grammars ?
presumably the file(s) might have to be passed twice in this case,
somehow skipping this code, is this how its done normally or am I missing something ?

This isn't anything to do with parsing. What you're describing here is
semantic analysis, i.e. the validation that the parsed input (whilst is
grammatically correct) makes sense in the context of the overall
compile. Semantic analysis is just a further step in the process. It
doesn't have to re-visit the code, it works on the parse tree generated
by the parsing phase.

colin · Oct 17, 2009

Tom Spink said:
Please explain why you think that is ambiguous.

well, it doesnt reduce to one unique set of tokens - there are two possible sets of tokens that describes it
as above, one for if a is a type, another for it a is data.

This isn't anything to do with parsing. What you're describing here is
semantic analysis, i.e. the validation that the parsed input (whilst is
grammatically correct) makes sense in the context of the overall
compile. Semantic analysis is just a further step in the process. It
doesn't have to re-visit the code, it works on the parse tree generated
by the parsing phase.

Thanks, I dont know in detail the actual distinction between the steps,
its just a decision tree generator based parser I knocked up quickly many years ago

however I am familiar with output of the parser being in the form of a tree that resembles the
tokens in the grammar definition.

however in this case the output tree would consist of different tokens depending on whether a is a type or not.

1: ( type ) unary_expression =reduces to=> cast_expression
2: ( expression ) - multiplicative_expression =reduces to=>
multiplicative_expression additive_operators multiplicative_expression =reduces to=>
additive_expression

the alternatives would be to output less detailed information which does not need to know if a is a type
and fill in more when the symbols are known. so output data from the parser which does not distinguish between

cast_expression: ( type ) unary expression
and expressions involving parenthesised expressions.

however this makes it tricky when the expression continues further.

the other alternative is to include two alternative paths in the output tree.
I had considered this might be necessary becuase of something like this.

Colin =^.^=

Tom Dacon · Oct 17, 2009

colin said:
thanks but I think it still is as,
(int) -1
is valid becuase -1 is a unary-expression and so does not need parentheses
it also compiles ok.

Well, this is unusual:

this doesn't compile for me on VS2008 SP1:

Int32 i = (Int32) - 1;

but produced the following two errors:

Error 1 To cast a negative value, you must enclose the value in parentheses
Error 2 'int' is a 'type' but is used like a 'variable'

but curiously,

int i = (int)-1;

compiled without error.

Two things that seem odd about that: error 2 complains about type 'int'
instead of Int32, and 'int' as I understand it is really just a synonym for
(on a 32-bit machine, at least) Int32.

Tom Dacon

colin · Oct 17, 2009

Tom Dacon said:
Well, this is unusual:

this doesn't compile for me on VS2008 SP1:

Int32 i = (Int32) - 1;

but produced the following two errors:

Error 1 To cast a negative value, you must enclose the value in parentheses
Error 2 'int' is a 'type' but is used like a 'variable'

but curiously,

int i = (int)-1;

compiled without error.

Two things that seem odd about that: error 2 complains about type 'int' instead of Int32, and 'int' as I understand it is really
just a synonym for (on a 32-bit machine, at least) Int32.

oh my, yes that is odd indeed ! :s

int i=(int)-1;
int x=(i)-1;
compiled ok with vsc# express 2008 (win32) but trying with Int32 instead I also get the same different results !

that behavour is not in the grammer specification I am looking at.

I have managed to fudge things enough so at least my parser
now accepts the code (a)-1 defaulting to a cast if its unsure -
as one is unlikly to put parenthesis around a single variable.
however there are a number of other places ive had to make assumptions on the most
likly path.

aha i little bit of fiddling around reveals that a predefined type such as int works
but Int32 is not a predefined type yet it is the same otherwise - its just not a keyword as such.

I tried defining a struct called a and it then insists I put brackets around the same part as above

struct a
{
}
unsafe void func(a* p)
{
a x=(a)*p; //<- complains if no brackets.
a x=(a)(*p); //is ok
}

Error 'a' is a 'type' but is used like a 'variable'

thanks
Colin =^.^=

Tom Spink · Oct 17, 2009

colin said:
well, it doesnt reduce to one unique set of tokens - there are two possible sets of tokens that describes it
as above, one for if a is a type, another for it a is data.

A reduction isn't a rule reducing to a set of tokens, it is a set of
tokens reducing to a production.

Thanks, I dont know in detail the actual distinction between the steps,
its just a decision tree generator based parser I knocked up quickly many years ago
however I am familiar with output of the parser being in the form of a tree that resembles the
tokens in the grammar definition.

however in this case the output tree would consist of different tokens depending on whether a is a type or not.
1: ( type ) unary_expression =reduces to=> cast_expression
Yes.

2: ( expression ) - multiplicative_expression =reduces to=>
multiplicative_expression additive_operators multiplicative_expression =reduces to=>
additive_expression

No. A reduction takes a set of input tokens and matches it to a
production (i.e. a rule). It does not generate more rules.

Without seeing more of your grammar definition, I can't describe to you
how the parser should match the tokens.

the alternatives would be to output less detailed information which does not need to know if a is a type
and fill in more when the symbols are known. so output data from the parser which does not distinguish between

cast_expression: ( type ) unary expression
and expressions involving parenthesised expressions.

however this makes it tricky when the expression continues further.

the other alternative is to include two alternative paths in the output tree.
I had considered this might be necessary becuase of something like this.

The C# grammar is not ambiguous. I fear your grammar definitions may be
incorrect, or your parser implementation may be flawed slightly.

If you could provide some more examples and more grammar definitions,
perhaps we can work this out.

Tom Spink · Oct 17, 2009

Tom said:
Well, this is unusual:

this doesn't compile for me on VS2008 SP1:

Int32 i = (Int32) - 1;

That's not unusual - that's clearly incorrect.

"- 1" is != "-1"

Having a space after the negative sign is wrong if you are trying to
represent a negative number.

but produced the following two errors:

Error 1 To cast a negative value, you must enclose the value in parentheses
Error 2 'int' is a 'type' but is used like a 'variable'

Right, because the parser has matched that statement to an expression,
and the semantic analyser is reporting that Int32 isn't a variable, it's
a type.

but curiously,

int i = (int)-1;

Why's that curious, it seems perfectly valid to me.

compiled without error.

Two things that seem odd about that: error 2 complains about type 'int'
instead of Int32, and 'int' as I understand it is really just a synonym for
(on a 32-bit machine, at least) Int32.

'int' is a synonym for Int32, on any machine.

(IntPtr is 32-bit in on a 32-bit machine and 64-bit on a 64-bit machine)

Family Tree Mike · Oct 17, 2009

Tom said:
That's not unusual - that's clearly incorrect.

"- 1" is != "-1"

Having a space after the negative sign is wrong if you are trying to
represent a negative number.

But the following compiles and displays -1, as expected.

static void Main(string[] args)
{
int i = - 1;

Console.WriteLine(i);
Console.ReadKey();
}

colin · Oct 17, 2009

Tom Spink said:
A reduction isn't a rule reducing to a set of tokens, it is a set of
tokens reducing to a production.

No. A reduction takes a set of input tokens and matches it to a
production (i.e. a rule). It does not generate more rules.

Without seeing more of your grammar definition, I can't describe to you
how the parser should match the tokens.

The C# grammar is not ambiguous. I fear your grammar definitions may be
incorrect, or your parser implementation may be flawed slightly.

If you could provide some more examples and more grammar definitions,
perhaps we can work this out.

hi,
Ive taken the grammer definitions straight from the ECMA-334 c# language definition
as closely as possible. even the 'opt' at the end of the tokens is being interpreted.
I assume this is a correct definition ?

if we say the grammar is expressed as rules which are expressed as lines of tokens which
are either terminal tokens or tokens which are defined further in other rules.

My code produces a (rather large) decision tree which when applied to the input file reduces the
file tokens to a rule, and then if that rule is a token of another rule it reduces again.
im not sure if production as you call it is the whole collection of tokens.
but it ends up with a tree as you said, sorry if my depiction of a tree above was not very clear.
the lexical stage seems quite ok.

the following rules are straight from the spec i mentioned.
~~~~~~~~
1)
unary-expression:
primary-expression
+ unary-expression
- unary-expression
! unary-expression
~ unary-expression
pre-increment-expression
pre-decrement-expression
cast-expression

2)
primary-expression:
array-creation-expression
primary-no-array-creation-expression

3)
primary-no-array-creation-expression:
literal
simple-name
parenthesized-expression
......

4)
parenthesized-expression:
( expression )

5)
expression:
conditional-expression

6)
conditional-expression:
.... several rules nested below this we get to 7 below...

7)
additive-expression:
multiplicative-expression
additive-expression + multiplicative-expression
additive-expression – multiplicative-expression

multiplicative-expression:
unary-expression

8)
cast-expression:
( type ) unary-expression

~~~~~~~~

in the following we can see that if a is a type then it can be matched to rule 8 quite exactly
(a) -1

as - 1 is a valid unary-expression (with or without spaces or parenthesis)

if a is not a type then we can see (a) matches rule 4, which is eventually reduced to a unary expression,
this then matches the addition/subtraction expression in rule 7

some of the rules are very hard to follow by hand indeed, I havnt included the rule for type as
although it seems quite simple for the case of (a) it is in fact quite long.
I cant find any clues in the text of the spec although il be having another read.

so far it does accept about 100k lines of code - ie its own code.

the operation of the vs2008 seems to be that -1 needs to be in parenthesis if a is a type
unless type is a predefined type such as int or sbyte etc, but Int32 is not considered as a predefined type.

some other places ive fudged it to do what vs2008 does, which is the main criterea for now.
however there is not an easy fudge here.

the use it will be put to will vary from expanding some very heavily used generic code,
analysing code for side effects and potential run time errors,
and hopefully some interesting stuff such as conversion to shader code/mmx etc.
ultimatly some non procedural aspect (hence the side effect analysis)

thanks
Colin =^.^=

Peter Duniho · Oct 18, 2009

Family said:
Tom said:

That's not unusual - that's clearly incorrect.

"- 1" is != "-1"

Having a space after the negative sign is wrong if you are trying to
represent a negative number.

Click to expand...

But the following compiles and displays -1, as expected.

static void Main(string[] args)
{
int i = - 1;

Console.WriteLine(i);
Console.ReadKey();
}

Right. It's important to note here that the whitespace isn't relevant.
You can write "int i = (int) - 1;" and that will still successfully
compile, and you can also write "Int32 i = (Int32)-1;" and that will
still fail to compile.

Don't get distracted by the whitespace. It's not important to the question.

I agree that at first blush, this _looks_ like a compiler bug. However,
IMHO the C# specification is clear about what's supposed to happen here
and why. In particular, from "7.6.6 Cast expressions":

The grammar for a cast-expression leads to certain
syntactic ambiguities. For example, the expression
(x)–y could either be interpreted as a cast-expression
(a cast of –y to type x) or as an additive-expression
combined with a parenthesized-expression (which computes
the value x – y).

To resolve cast-expression ambiguities, the following
rule exists: A sequence of one or more tokens (§2.3.3)
enclosed in parentheses is considered the start of a
cast-expression only if at least one of the following
are true:

• The sequence of tokens is correct grammar for a type,
but not for an expression.

• The sequence of tokens is correct grammar for a type,
and the token immediately following the closing parentheses
is the token “~”, the token “!”, the token “(”, an identifier
(§2.4.1), a literal (§2.4.4), or any keyword (§2.4.3) except
as and is.

The term “correct grammar” above means only that the sequence
of tokens must conform to the particular grammatical production.
It specifically does not consider the actual meaning of any
constituent identifiers. For example, if x and y are
identifiers, then x.y is correct grammar for a type, even
if x.y doesn’t actually denote a type.

From the disambiguation rule it follows that, if x and y are
identifiers, (x)y, (x)(y), and (x)(-y) are cast expressions,
but (x)-y is not, even if x identifies a type. However, if x
is a keyword that identifies a predefined type (such as int),
then all four forms are cast-expressions (because such a
keyword could not possibly be an expression by itself).

Note in particular the last two paragraphs. The first clarifies that
when applying this rule, _what_ an identifier represents isn't relevant;
only the syntactical construction matters.

The second clarifies that the grammar analysis considers built-in
keywords differently from identifiers. In particular, while "int" is an
alias for "System.Int32", it is _also_ a language-defined keyword, while
System.Int32 is not.

Hence the different treatment: when you write "Int32", because the rule
doesn't care what that identifier is but rather simply that it is an
identifier and not a reserved word, that _could_ be "correct grammar for
an expression", causing neither of the two conditions in the rule to be
true. But when you write "int", since that's a reserved keyword in the
language, it's impossible for it to be correct grammar for an
expression, and thus the first condition applies, allowing for the cast
of the unary expression.

IMHO, the lesson here is that if one is going to try to parse C#, one
needs to read the entire specification and be very careful to apply the
specification exactly as it's written. It's not sufficient to just look
at the grammar as described in the specification, and one definitely
can't rely on a "common-sense" description of the grammar.

For what it's worth, .NET already provides classes for interpreting C#,
for the purpose of compiling it. I don't recall if the expression tree
support includes feeding it plain C#, but there might even be a way to
accomplish that, depending on your specific goals. It might be that it
makes more sense to use the built-in support in .NET, rather than trying
to reinvent that particular wheel.

Hope that helps.

Pete

Peter Duniho · Oct 18, 2009

colin said:
Ive taken the grammer definitions straight from the ECMA-334 c# language definition
as closely as possible. even the 'opt' at the end of the tokens is being interpreted.
I assume this is a correct definition ?

Please see my other reply. The short version: the grammar definitions
aren't the sole piece of information you need.

[...]
the operation of the vs2008 seems to be that -1 needs to be in parenthesis if a is a type
unless type is a predefined type such as int or sbyte etc, but Int32 is not considered as a predefined type.

More specifically, it's that "int" is a _reserved word_, while "Int32"
is not. Both are predefined, built-in types. But for the purposes of
the lexical analysis, that's not important. _What_ an identifier refers
to is ignored; only its characteristics lexically are considered.

some other places ive fudged it to do what vs2008 does, which is the main criterea for now.
however there is not an easy fudge here. [...]

IMHO, if you feel you really must duplicate the C# compiler's features,
you need to implement them the same way the C# compiler team did: follow
the C# specification exactly. It's not a "fudge" to implement the rules
as described in the specification. It just means you need to implement
_all_ the rules.

Note also that if you do that, you will successfully deal with _all_ of
the possible scenarios the rules are intended to address. That way,
rather than special-casing some particular operator or syntax, you can
implement the rules such that they work no matter what the exact text
you're parsing is.

Pete

colin · Oct 18, 2009

Peter Duniho said:
Family said:

Tom said:

Tom Dacon wrote:

Well, this is unusual:

this doesn't compile for me on VS2008 SP1:

Int32 i = (Int32) - 1;

That's not unusual - that's clearly incorrect.

"- 1" is != "-1"

Having a space after the negative sign is wrong if you are trying to
represent a negative number.

Click to expand...

But the following compiles and displays -1, as expected.

static void Main(string[] args)
{
int i = - 1;

Console.WriteLine(i);
Console.ReadKey();
}

Click to expand...

Right. It's important to note here that the whitespace isn't relevant. You can write "int i = (int) - 1;" and that will still
successfully compile, and you can also write "Int32 i = (Int32)-1;" and that will still fail to compile.

Don't get distracted by the whitespace. It's not important to the question.

I agree that at first blush, this _looks_ like a compiler bug. However, IMHO the C# specification is clear about what's supposed
to happen here and why. In particular, from "7.6.6 Cast expressions":

The grammar for a cast-expression leads to certain
syntactic ambiguities. For example, the expression
(x)–y could either be interpreted as a cast-expression
(a cast of –y to type x) or as an additive-expression
combined with a parenthesized-expression (which computes
the value x – y).

To resolve cast-expression ambiguities, the following
rule exists: A sequence of one or more tokens (§2.3.3)
enclosed in parentheses is considered the start of a
cast-expression only if at least one of the following
are true:

• The sequence of tokens is correct grammar for a type,
but not for an expression.

• The sequence of tokens is correct grammar for a type,
and the token immediately following the closing parentheses
is the token “~”, the token “!”, the token “(”, an identifier
(§2.4.1), a literal (§2.4.4), or any keyword (§2.4.3) except
as and is.

The term “correct grammar” above means only that the sequence
of tokens must conform to the particular grammatical production.
It specifically does not consider the actual meaning of any
constituent identifiers. For example, if x and y are
identifiers, then x.y is correct grammar for a type, even
if x.y doesn’t actually denote a type.

From the disambiguation rule it follows that, if x and y are
identifiers, (x)y, (x)(y), and (x)(-y) are cast expressions,
but (x)-y is not, even if x identifies a type. However, if x
is a keyword that identifies a predefined type (such as int),
then all four forms are cast-expressions (because such a
keyword could not possibly be an expression by itself).

Note in particular the last two paragraphs. The first clarifies that when applying this rule, _what_ an identifier represents
isn't relevant; only the syntactical construction matters.

The second clarifies that the grammar analysis considers built-in keywords differently from identifiers. In particular, while
"int" is an alias for "System.Int32", it is _also_ a language-defined keyword, while System.Int32 is not.

Hence the different treatment: when you write "Int32", because the rule doesn't care what that identifier is but rather simply
that it is an identifier and not a reserved word, that _could_ be "correct grammar for an expression", causing neither of the two
conditions in the rule to be true. But when you write "int", since that's a reserved keyword in the language, it's impossible for
it to be correct grammar for an expression, and thus the first condition applies, allowing for the cast of the unary expression.

IMHO, the lesson here is that if one is going to try to parse C#, one needs to read the entire specification and be very careful
to apply the specification exactly as it's written. It's not sufficient to just look at the grammar as described in the
specification, and one definitely can't rely on a "common-sense" description of the grammar.

For what it's worth, .NET already provides classes for interpreting C#, for the purpose of compiling it. I don't recall if the
expression tree support includes feeding it plain C#, but there might even be a way to accomplish that, depending on your specific
goals. It might be that it makes more sense to use the built-in support in .NET, rather than trying to reinvent that particular
wheel.

Hope that helps.

Pete

yay, thanks very much for that Pete.

I never considered it as a bug as I know how complicated this is,
and know there are bound to be fine details.
I read the spec several times looking for such info, but somehow I missed that
part you quoted :s

it seems to make sense and fits with what ive found out,
I just have to figure out the best way to specify it.

I also have to search for other such disambiguation rules if there are any.
a quick search didnt show any others.

the other fudges are more to do with minimising lookahead problems.

in particular fixed pointer initialization where
void * p = &x+1;
it looks like there are two paths through the grammer rules
which defeats my lookahed. from the rules its unclear if the address is taken
before or after the addition - obviously taking it after is meaningless.

fixed-pointer-initializer:
& variable-reference
expression

variable-reference:
expression

addressof-expression:
& unary-expression

as variable-reference is also an expression, and expression itself also contains an address_of rule.
the spec states variable-reference is an expression that is classified as a variable.
I probaly have some more reasearch to do on that one.

parenthesised expresion being used as a function also seems to be accepted by the rules
but is not by the compiler so I assume there is another disambiguation here
as this situation too defeats my lookahead.

but those are the only other two so far that defeat the lookeahead.

obviously the semantic rules would be quite another matter.
im not sure which is the hardest part of doing all this.
but I hope I only need to a smal prt of the latter.

its intresting how you get to know the real nitty gritty of a language when you
try and parse it.

I have used the .net runtime compilation already, although not the parser.
the parser is a general purpose parser rather than just for C#
I hope some tools I can write can be usefull for more than one language.
I initialy started this project before C++ was well known but droped it once
I moved to C++.

Colin =^.^=

Peter Duniho · Oct 18, 2009

colin said:
[...]
the other fudges are more to do with minimising lookahead problems.

in particular fixed pointer initialization where
void * p = &x+1;
it looks like there are two paths through the grammer rules
which defeats my lookahed. from the rules its unclear if the address is taken
before or after the addition - obviously taking it after is meaningless.

fixed-pointer-initializer:
& variable-reference
expression

variable-reference:
expression

addressof-expression:
& unary-expression

as variable-reference is also an expression, and expression itself also contains an address_of rule.
the spec states variable-reference is an expression that is classified as a variable.

Right. Specifically, it's not that any "expression" can be used as a
"variable-reference". Rather, the grammar element "expression" is
classified into a number of possible categories, one of which is "a
variable", which has very specific rules that exclude something like "x+1".

I would expect that by the time you get to parsing "&x+1", since "x+1"
is not a viable candidate as a "variable-reference" (it's an expression,
but not one classified as a variable), that would exclude that as a
possible parse result for the text. Rather, you'd see the "&" followed
by an expression classified as a variable (i.e. "x"), and that
automatically would fulfill the "&"-followed-by-variable-reference rule.

A similar approach applies to the "addressof-expression". While the
grammar says "unary-expression", the specification clearly states that
the "unary-expression" E must be an expression classified as a variable.

So, in spite of the apparent ambiguity in the grammar description
itself, the specification does resolve that in an unambiguous way.

I probaly have some more reasearch to do on that one.

I think it's just another example of how the structured grammar
description isn't sufficient. It's important to read and implement the
other rules described in the specification.

Whether that avoids a "lookahead problem" for your particular parsing
code, I don't know. To the extent that "&" can only ever be followed by
an expression classified as a variable, I don't see that there's any
need to look ahead. But maybe I just don't understand what you mean by
"lookahead problem".

There's nothing that could follow the "expression classified as a
variable" that would change how you're allowed to interpret the "&"
followed by an "expression classified as a variable", and if it's not
followed by an "expression classified as a variable", it's not valid.

parenthesised expresion being used as a function also seems to be accepted by the rules
but is not by the compiler so I assume there is another disambiguation here
as this situation too defeats my lookahead.

Example of "parenthesized expression being used as a function" being?

I suspect that, with the expression classification, the fact that the
expression needs to be classified as a method group for trailing parens
to be valid would allow the parsing to exclude other possibilities. But
I'm a little unclear as to what exactly you're describing, so that might
be inapplicable or I might have the specifics wrong.

but those are the only other two so far that defeat the lookeahead.

obviously the semantic rules would be quite another matter.
im not sure which is the hardest part of doing all this.
but I hope I only need to a smal prt of the latter.

its intresting how you get to know the real nitty gritty of a language when you
try and parse it.

Well, yes.

There's not really any room for error or simplification,
if your parsing needs to result in something that actually represents
the structure of the program. (I've seen syntax coloring implementations
that don't need that kind of detail, but then they have corner cases
where they color the text incorrectly, so there you go

). Your
understanding of the structure of C# probably already exceeds that of
99.9% of the people who use it, and by the time you're all done, you'll
be parsing C# in your sleep.

Pete

colin · Oct 18, 2009

Peter Duniho said:
colin said:

[...]

Click to expand...

I would expect that by the time you get to parsing "&x+1", since "x+1" is not a viable candidate as a "variable-reference" (it's
an expression, but not one classified as a variable), that would exclude that as a possible parse result for the text. Rather,
you'd see the "&" followed by an expression classified as a variable (i.e. "x"), and that automatically would fulfill the
"&"-followed-by-variable-reference rule.

A similar approach applies to the "addressof-expression". While the grammar says "unary-expression", the specification clearly
states that the "unary-expression" E must be an expression classified as a variable.

ah yes i think that makes it a bit clearer.
thanks. so &x+1 would only then have one path.
although &x might still apear to have 2 valid paths as x is a
variable and &x is also a unary expression.
to me the & variable-reference apears superflous so removing it is my fudge for now.
I dont expect to need to interpret unsafe code anyway,
as I intend to implement an alternative.

I was originally expecting to find some rule to state if the rule on the left or on the right
gets expanded as far as possible, or perhaps the rule closest to the terminal tokens.

I think it's just another example of how the structured grammar description isn't sufficient. It's important to read and
implement the other rules described in the specification.

yeah, I knew this was just a limitation of the rules as they can be defined in italics.
There is a section on such limitation called grammar ambiguity,
which mentions the
F(G<A, B>(7));

problem, but im familiar with this from c++.
but some of the more obscure cases are a bit harder to spot.
its not easy to scan 500 pages looking for something when you dont know
quite what it is. ive read some parts of it so often,,,
after a while my brain goes numb.

ive managed to do the lexical part with no fudging thankfully.

its interesting the nature of these problems.

Whether that avoids a "lookahead problem" for your particular parsing code, I don't know. To the extent that "&" can only ever be
followed by an expression classified as a variable, I don't see that there's any need to look ahead. But maybe I just don't
understand what you mean by "lookahead problem".

lookahead in my case is computer generated in the decision tree,
where if two possible paths exist in the concise grammar rules for the first token or two
then usualy only one path will be valid after further tokens are read.
for example "private" exists in many rules, you have to look a considerable way
ahead if for example the return type of a method is a complicated generic expression.
untill a certain point you dont know if it is say a method-modifier or a property modifier.

so once the end of two possible rules are found, more tokens are read
before they can be reduced to the correct one.
if the two paths always remain valid the lookahead information would
try to keep going forever. obviously I have to detect when this happens,
and this is what points me to these cases.
its only recently that its not been my code thats the problem.

Example of "parenthesized expression being used as a function" being?

I suspect that, with the expression classification, the fact that the expression needs to be classified as a method group for
trailing parens to be valid would allow the parsing to exclude other possibilities. But I'm a little unclear as to what exactly
you're describing, so that might be inapplicable or I might have the specifics wrong.

well this cuases me problems too,,

(f)(1);

where f could resolve to a function, (or again our friend the typecast)
however vs2008 disalows paranthesis around the f.
ive yet to find this in the text but I suspect its there somewhere but
ive just scanned through it again. however I have disalowed this too,
I cant imagine anyone would want to do this, so it now all seems to accept
the same as does vs2008, which for now is a reasonable target.
ofc this is seperate from the case of

((Ilist<int>)l).Add(0);

where the parenthesis only covers the instance not the member name as well.
my parser accepts this ok.

... Your understanding of the structure of C# probably already exceeds that of 99.9% of the people who use it, and by the time
you're all done, you'll be parsing C# in your sleep.

Pete

I would say your understanding is closer yet to 100%

yeah, I sometimes wonder if im the only one who has reecuring nightmares
about being stuck inside an endless for loop,,,
this is my nightmare atm :-

int index = (int)(unchecked((uint)destinationIndex[j].hashCode % (uint)prime));

the line of code is fine, and the grammar rules are fine in the decision tree
too but my code which traverses it seems to drop a parenthisis somehow.
the debug info runs to so many pages.

Colin =^.^=

Peter Duniho · Oct 18, 2009

colin said:
[...]
well this cuases me problems too,,

(f)(1);

where f could resolve to a function, (or again our friend the typecast)
however vs2008 disalows paranthesis around the f.
ive yet to find this in the text but I suspect its there somewhere

I believe what you're looking for is here:

7.5.3 Parenthesized expressions

A parenthesized-expression consists of an expression
enclosed in parentheses.

parenthesized-expression:
( expression )

A parenthesized-expression is evaluated by evaluating
the expression within the parentheses. If the expression
within the parentheses denotes a namespace, type, or
method group, a compile-time error occurs.

Note the last sentence. So, an invocation-expression, which is
"primary-expression ( argument-list[opt] )", will specifically exclude
the possibility of the "primary-expression" taking the form of "(
expression )", because putting a method group name inside parentheses is
not allowed.

[...]
this is my nightmare atm :-

int index = (int)(unchecked((uint)destinationIndex[j].hashCode % (uint)prime));

the line of code is fine, and the grammar rules are fine in the decision tree
too but my code which traverses it seems to drop a parenthisis somehow.

Sounds like an off-by-one error somewhere. Not that that necessarily
makes it easier to find the bug.

Pete

colin · Oct 19, 2009

Peter Duniho said:
colin said:

[...]
well this cuases me problems too,,

(f)(1);

where f could resolve to a function, (or again our friend the typecast)
however vs2008 disalows paranthesis around the f.
ive yet to find this in the text but I suspect its there somewhere

Click to expand...

I believe what you're looking for is here:

7.5.3 Parenthesized expressions

A parenthesized-expression consists of an expression
enclosed in parentheses.

parenthesized-expression:
( expression )

A parenthesized-expression is evaluated by evaluating
the expression within the parentheses. If the expression
within the parentheses denotes a namespace, type, or
method group, a compile-time error occurs.

Note the last sentence. So, an invocation-expression, which is "primary-expression ( argument-list[opt] )", will specifically
exclude the possibility of the "primary-expression" taking the form of "( expression )", because putting a method group name
inside parentheses is not allowed.

ah thats cool, that is what I was expecting to find.
seems the 3.0 spec has a bit more in it than the one i was using
in this respect.

I seem to be reorganising the grammar rules quite a bit to avoid
multiple paths which although benign, do complicate things a bit.
I gues it wasnt meant to be fed into a parser as is.
reorganizing again avoids the not allowed situations.

I did notice quite a few hits on google of people asking for
c# parser rules, but no one seemed to find any that were ideal.

[...]
this is my nightmare atm :-

int index = (int)(unchecked((uint)destinationIndex[j].hashCode % (uint)prime));

the line of code is fine, and the grammar rules are fine in the decision tree
too but my code which traverses it seems to drop a parenthisis somehow.

Click to expand...

Sounds like an off-by-one error somewhere. Not that that necessarily makes it easier to find the bug.

yeah it just happens that the lookahed logic kicks off many times here,
which is complicated and hard to ensure is free from errors.
I also have a few more table generator warnings to look into especialy as
they all seem to be related to cast-expression. reorganising the rules
seems to fix the problem, but its worying im not sure why.
but it might be good enough for now, hopefully I can at least
create a tool which will make it a bit easier.

thanks
Colin =^.^=

pandiyaraj · Nov 7, 2009

anybody hav good C# tutorials?

Family Tree Mike · Nov 7, 2009

pandiyaraj said:
anybody hav good C# tutorials?

Why did you post to this thread?

There are some here:
http://msdn.microsoft.com/en-us/library/aa288436(VS.71).aspx

parsing c#

colin

Tom Dacon

colin

Tom Spink

colin

Tom Dacon

colin

Tom Spink

Tom Spink

Family Tree Mike

colin

Peter Duniho

Peter Duniho

colin

Peter Duniho

colin

Peter Duniho

colin

pandiyaraj

Family Tree Mike