Regular expression

  • Thread starter Thread starter Cylix
  • Start date Start date
C

Cylix

I am going to write a function that the search engine done.
in search engine, we may using double quotation to specify a pharse
like "I love you",
How can I using regular expression to sperate each pharse?

test case:
"I love" all "of you"


I would like it return:
"I love", all, "of you"


Thank you!
 
I would make a pattern that matches spaces with an optional quoted
phrase, and split on that.

Some untested code, but it should get you started:

Regex re = new Regex(@" |(?: ?(""[^""]*"") ?)");
string[] splitted = re.Split(input);
 
Well, you've made the usual mistake of not defining your rules. An example
may imply some rules, but not others. For example, your example does not
state whether or not an odd number of double-quotes might be found in the
string. You have not specifically said whether or not double-quotes
surrounding a phrase must be included in the match, nor whether spaces
surrounding a phrase must be included in the match. There are a number of
other rules which are not specified as well, such as handling line breaks.

A regular expression is an expression of a set of rules which must be
absolutely specific.

However, I will give you a few examples that should cover the various
possibilities.

First, we are looking at 2 specific sets of rules:

1. A phrase surrounded by double-quotes.
2. A phrase *not* surrounded by double-quotes.

Therefore, in order to match them, we must either create 2 groups, or use
one group to split the total string into matches of the other. If we use 2
groups, we can get both, but we will have to sort out which is which. If we
only use one, we will need to perform 2 sets of operations:

1. Match all matches.
2. Split and get all remaining elements.

So, the rule for the phrases surrounded by quotes is fairly simple:

"[^"]*"

Translated, this says that a match is defined by a double-quote, followed by
zero or more non-double-quotes (any character except a double-quote),
followed by a double-quote. This will capture, in your example:

"I love"
"of you"

Now, if you create a rule that is the opposite of that, you get:

[^"]*

Translated, this says that a match is any phrase *not* containing a
double-quote.

These 2 can be used together with grouping and an "or " ('|') operator, as
in:

("[^"]*")|([^"]*)

It is important to order them in this way, as the first group will capture
double-quotes, and the second group will capture anything *except*
double-quotes. If the second group is used first, it will capture the
phrases captured by the first group without capturing the double-quotes, and
the first group will not, as they have already been consumed.

When using this version, both groups are captured, effectively capturing the
entire string into 2 groups of matches, and you use the groups to identify
which regular expression was matched (quoted in group 1 and non-quoted in
group 2). You should also note that the second group will capture spaces
between the quoted phrases and the non-quoted phrases, as part of the
non-quoted phrase. I know of no way to trim this in the regular expression
itself, so you would have to trim the values from the matches themselves.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Development Numbskull

Abnormality is anything but average.

Göran Andersson said:
I would make a pattern that matches spaces with an optional quoted phrase,
and split on that.

Some untested code, but it should get you started:

Regex re = new Regex(@" |(?: ?(""[^""]*"") ?)");
string[] splitted = re.Split(input);
I am going to write a function that the search engine done.
in search engine, we may using double quotation to specify a pharse
like "I love you",
How can I using regular expression to sperate each pharse?

test case:
"I love" all "of you"


I would like it return: "I love", all, "of you" Thank you!
 
Was that a reply for me, or did you intend to reply the original poster?
 
It was intended for the original poster, but I hit the reply button while
your message was opened.

Sorry about any confusion.
--
HTH,

Kevin Spencer
Microsoft MVP
Professional Development Numbskull

Abnormality is anything but average.

Göran Andersson said:
Was that a reply for me, or did you intend to reply the original poster?

Kevin said:
Well, you've made the usual mistake of not defining your rules. An
example may imply some rules, but not others. For example, your example
does not state whether or not an odd number of double-quotes might be
found in the string. You have not specifically said whether or not
double-quotes surrounding a phrase must be included in the match, nor
whether spaces surrounding a phrase must be included in the match. There
are a number of other rules which are not specified as well, such as
handling line breaks.

A regular expression is an expression of a set of rules which must be
absolutely specific.

However, I will give you a few examples that should cover the various
possibilities.

First, we are looking at 2 specific sets of rules:

1. A phrase surrounded by double-quotes.
2. A phrase *not* surrounded by double-quotes.

Therefore, in order to match them, we must either create 2 groups, or use
one group to split the total string into matches of the other. If we use
2 groups, we can get both, but we will have to sort out which is which.
If we only use one, we will need to perform 2 sets of operations:

1. Match all matches.
2. Split and get all remaining elements.

So, the rule for the phrases surrounded by quotes is fairly simple:

"[^"]*"

Translated, this says that a match is defined by a double-quote, followed
by zero or more non-double-quotes (any character except a double-quote),
followed by a double-quote. This will capture, in your example:

"I love"
"of you"

Now, if you create a rule that is the opposite of that, you get:

[^"]*

Translated, this says that a match is any phrase *not* containing a
double-quote.

These 2 can be used together with grouping and an "or " ('|') operator,
as in:

("[^"]*")|([^"]*)

It is important to order them in this way, as the first group will
capture double-quotes, and the second group will capture anything
*except* double-quotes. If the second group is used first, it will
capture the phrases captured by the first group without capturing the
double-quotes, and the first group will not, as they have already been
consumed.

When using this version, both groups are captured, effectively capturing
the entire string into 2 groups of matches, and you use the groups to
identify which regular expression was matched (quoted in group 1 and
non-quoted in group 2). You should also note that the second group will
capture spaces between the quoted phrases and the non-quoted phrases, as
part of the non-quoted phrase. I know of no way to trim this in the
regular expression itself, so you would have to trim the values from the
matches themselves.
 
Back
Top