Writing parser right way in c#

siddharthkhare · Mar 11, 2006

Hi All,
I need to parse certain text from a paragraph (like 20 lines).

I know the exact tags that I am looking for.

my approach is to define a xml (config) file that defines what tag I am
looking for and corresponding regular expression to search for the
pattern.

Xml file will also have a way to say what should be the pervious tag
and what should be the next tag. Again some of it through regular
expression and some of it through logic.

Run time just read the xml .find each tag and corresponding regular
expression execute it.

Assuming there may be more additions of the patterns and there might be
more rules coming up , Is this the best approach for this.

Are there other ways to make it more flexible and generic.

I don't want to end with stringent rules rather develop some sort of
extendable grammar.

Any Ideas
-KS

Frans Bouma [C# MVP] · Mar 12, 2006

Hi All,
I need to parse certain text from a paragraph (like 20 lines).

I know the exact tags that I am looking for.

my approach is to define a xml (config) file that defines what tag I
am looking for and corresponding regular expression to search for the
pattern.

Xml file will also have a way to say what should be the pervious tag
and what should be the next tag. Again some of it through regular
expression and some of it through logic.

Run time just read the xml .find each tag and corresponding regular
expression execute it.

Assuming there may be more additions of the patterns and there might
be more rules coming up , Is this the best approach for this.

Are there other ways to make it more flexible and generic.

I don't want to end with stringent rules rather develop some sort of
extendable grammar.

Any Ideas

You'll always end up with code that's tied to the grammar of your
'language', unless you're using an LR(n) parser core with action/goto
tables.

Normally, you'd use a lexical analyzer to convert text to tokens, then
interpret the tokens by a parser and 'handle' them by converting
streams of terminals (tokens) into non-terminals and execute actions
based on the determined non-terminals. Terminals and Non-terminals are
terms used in (E)BNF, the notation for grammar.

What you should focus on is to write something that works, rather than
something that can parse every language in the world, because that
won't work, there's always a part of the code that's tied to the
grammar. For example, if you're using a lr(n) parser generator which in
theory produces an action/goto table and uses a generic parser core, it
still has to have rule handlers which handle the action to be executed
when a non-terminal is found. For example, say you have the following
syntaxis:
http://www.microsoft.com
This then can be written in ENBF as:
URL -> UrlStartToken urltext UrlEndToken
UrlStartToken ->
UrlEndToken ->
urltext -> ...

Now, if the nonterminal 'URL' is found, it has to be handled, so the
rule handler for that nonterminal has to be written in code and is
therefore tied to the grammar and therefore not generic. But that's ok,
as you simply want to parse something, to get something done, not to
have something completely generic which doesn't do anything.

Frans

--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------

siddharthkhare · Mar 12, 2006

thanks for the reply.
are there any lexical analyzer available that I can use from .NET.

Also another question.
1)My understanding is taking a lexer approach makes more sense if you
are writing a compiler for a language like c#.
Beacuse you have to write a hadler/action for each non-terminal .
You have to to know each terminal /non-terminal when you are writing
your parser(design time).

If you anticipate more patterns to be added after your parser is
deployed ...so it should just be configuartion file change. you just
add the search string and the regular expression in config file and
your parser can handle it.

is lexer right approach in that case also? or I am better off with
regular expression?

2)secondly when you right a lexer you care about every word in the line
that you are parssing.

for example

object o = new object ();

you would go with lexer approach if you want to parser throuh each
token to make aure it is syntactically correct.
but if you just want to search for ..lets say..second occurance of
string "object" which is lets say 15 character away form the first
occurance than you are better of with just using regular expression.

Is my assumption correct?
Thanks
KS

siddharthkhare · Mar 12, 2006

thanks for the reply.
are there any lexical analyzer available that I can use from .NET.

Also another question.
1)My understanding is taking a lexer approach makes more sense if you
are writing a compiler for a language like c#.
Beacuse you have to write a hadler/action for each non-terminal .
You have to to know each terminal /non-terminal when you are writing
your parser(design time).

If you anticipate more patterns to be added after your parser is
deployed ...so it should just be configuartion file change. you just
add the search string and the regular expression in config file and
your parser can handle it.

is lexer right approach in that case also? or I am better off with
regular expression?

2)secondly when you right a lexer you care about every word in the line
that you are parssing.

for example

object o = new object ();

you would go with lexer approach if you want to parser throuh each
token to make aure it is syntactically correct.
but if you just want to search for ..lets say..second occurance of
string "object" which is lets say 15 character away form the first
occurance than you are better of with just using regular expression.

Is my assumption correct?
Thanks
KS

Frans Bouma [C# MVP] · Mar 13, 2006

thanks for the reply.
are there any lexical analyzer available that I can use from .NET.

not that I'm aware of, but they're not hard to write.

Also another question.
1)My understanding is taking a lexer approach makes more sense if you
are writing a compiler for a language like c#.
Beacuse you have to write a hadler/action for each non-terminal .
You have to to know each terminal /non-terminal when you are writing
your parser(design time).

If you anticipate more patterns to be added after your parser is
deployed ...so it should just be configuartion file change. you just
add the search string and the regular expression in config file and
your parser can handle it.

is lexer right approach in that case also? or I am better off with
regular expression?

a lexical analyzer is a routine which uses regular expressions

.
Best way is to define your tokens as regular expressions and use these
expressions to 'tokenize' your input stream. Especially if you're
having start/end tokens for statements.

2)secondly when you right a lexer you care about every word in the
line that you are parssing.

for example

object o = new object ();

you would go with lexer approach if you want to parser throuh each
token to make aure it is syntactically correct.
but if you just want to search for ..lets say..second occurance of
string "object" which is lets say 15 character away form the first
occurance than you are better of with just using regular expression.

Is my assumption correct?

You need 2 parts: a lexical analyzer, which converts the input stream
into a stream of tokens and a parser which converts the stream of
tokens into a stream of actions.

The parser is the place where the tokenstream is scanned for
correctness.

Frans

--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------

Writing parser right way in c#

siddharthkhare

Frans Bouma [C# MVP]

siddharthkhare

siddharthkhare

Frans Bouma [C# MVP]