RegEx: How to ignore the number of whitespaces?

  • Thread starter Thread starter Florian Haag
  • Start date Start date
F

Florian Haag

Hi,
I'm not sure whether this is the right group; I'm trying to achieve the
following with .NET's RegEx class:

I want to match strings while ignoring the number of whitespaces.
In a simple case, this would of course mean something like

a\s+b

which would match not only "a b", but also "a b", "a b" etc.

However, a case like

a\s+b?\s+b

already doesn't work for me any more, as it would only match "a c"
(two spaces in between), not "a c" (one space in between), if the "b"
is omitted. I can override this by using an expression like

a\s+(b\s+)?b

, which would already require some modifications from the input,
though, as users of the target application will not bother to include
the 2nd whitespace into the optional part of the string when they input
the expression (using a very simplified and otherwise limited syntax,
which I'd like to convert to RegEx).

Things get even more complicated in cases like this:

(a|b\s+)(c|\s+d)

It seems to me that I cannot evaluate this directly but instead have to
replace it with

ac|a\s+d|b\s+c|b\s+d

in order to make it match "b d" (one space in between), too, not only
"b d" (two spaces in between).

I wonder whether this can be done by including each \s+ into a named
group and then use an alternation construct referencing to the groups
of any possibly adjacent space, thereby determining whether another \s+
is required to match.
But maybe there's another, simpler (and maybe even faster?) way to
achieve this?

Thanks in advance,
Florian
 
Florian said:
Hi,
I'm not sure whether this is the right group; I'm trying to achieve the
following with .NET's RegEx class:

I want to match strings while ignoring the number of whitespaces.
In a simple case, this would of course mean something like

a\s*b would complely ignore the whitespaces, unless you want at least one.
a\s+b

which would match not only "a b", but also "a b", "a b" etc.

However, a case like

a\s+b?\s+b

already doesn't work for me any more, as it would only match "a c"
(two spaces in between), not "a c" (one space in between), if the "b"
is omitted. I can override this by using an expression like

This expression will never match a c are you trying to match a b b as
well as a b ?
a\s+(b\s+)?b

, which would already require some modifications from the input,
though, as users of the target application will not bother to include
the 2nd whitespace into the optional part of the string when they input
the expression (using a very simplified and otherwise limited syntax,
which I'd like to convert to RegEx).

Things get even more complicated in cases like this:

(a|b\s+)(c|\s+d)

It seems to me that I cannot evaluate this directly but instead have to
replace it with

ac|a\s+d|b\s+c|b\s+d
By that above the rules about the pattern of your strings are that it
can be either.

ac
a (at least one space) d
b (at least one space) c
b (at least one space) d
in order to make it match "b d" (one space in between), too, not only
"b d" (two spaces in between).

I wonder whether this can be done by including each \s+ into a named
group and then use an alternation construct referencing to the groups
of any possibly adjacent space, thereby determining whether another \s+
is required to match.
But maybe there's another, simpler (and maybe even faster?) way to
achieve this?

Thanks in advance,
Florian

Sounds like your homework to me, I don't understand what the format the
strings they are supposed to match.

Chris
 
^a\s*b?\s*c?\s*d?$

0 ore more spaces indicated. Each letter between spaces is optional. The
beginning and end of line characters prevent mis-ordered matches, such as
"ba", as the entire string must match.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
 
Kevin said:
^a\s*b?\s*c?\s*d?$

0 ore more spaces indicated.

Sorry, but I want a pattern which requires at least one whitespace, but
which, at the same time, does not _require_ more than one subsequent
whitespace.

Regards,
Florian
 
Hi! First of all, thanks for your response!

Chris said:
a\s*b would complely ignore the whitespaces, unless you want at least
one.

Yes, I do need at least one whitespace. I don't want to ignore the
whitespaces alltogether, I just want to ignore the number of subsequent
whitespaces.
This expression will never match a c are you trying to match a b b
as well as a b ?

Oops, sorry - the last "b" should have been a "c", as in

a\s+b?\s+c

However, this won't match "a c" (with one space in between).
By that above the rules about the pattern of your strings are that it
can be either.

ac
a (at least one space) d
b (at least one space) c
b (at least one space) d

Yes, that's correct - my question is whether I can go another way than
resolving (a|b\s+)(c|\s+d) to ac|a\s+d|b\s+c|b\s+d (which would
obviously mean to create all possible combinations of the (...|...)
parts). If each bracket hold more than two alternatives, this would
mean an enourmous increase in the size of the RegEx, which I'd like to
avoid, if possible.
Sounds like your homework to me, I don't understand what the format
the strings they are supposed to match.

It's definitely not my homework; it's actually for a vocabulary
training programme the first version of which you can find here:

http://VocDB.de.vu

The input strings are supposed to have the following format:

a and b may be replaced with any characters (or chains thereof) except
\[]()|.
\ preceding either of \[]()| escapes the respective symbol, otherwise
it'll have a special meaning, as described below.
[a] means "a" is optional.
[a|b] means either "a" or "b" or nothing may be written.
(a|b) means either "a" or "b" must be written.

There can be more than one | within each pair of brackets, delimiting
more than two alternatives.

i.e. the whole thing is something slightly Regex-like for
non-programmers.

In version 1 of the above programme, I use my own evaluator for this.
However, for the sake of maintainability, I hoped I could eventually
switch to simply converting those input patterns into RegEx-strings.
If only there were a way to ignore the number of subsuquent whitespaces
without ignoring that there _are_ whitespaces at all at certain places
in the word.

Kind regards,
Florian
 
If you can explain the requirements of the pattern you're trying to match,
without using any regular expression terminology, I can help. A regular
expression is a sequence of characters that represent a pattern, or a set of
rules regarding what is to be matched in text. Since you're having trouble
creating the regular expression, using regular expression symbol terminology
to explain the rules only confuses the issue.

Here's an example of what I mean:

"I want to match any number (greater than 0) of sequences of 1 or more
alphanumeric (only) characters with no spaces between them. Each sequence is
separated from the others by a single space, which may be any white space
character except for a line break. Any non-alpha-numeric character other
than a non-line-break white space character terminates a matching sequence."

Note that no regular expression terminology is used in the above
description. It describes the rules for a matching character sequence,
including what is required, how many of what is required are required, what
is NOT required, and what is prohibited.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

Florian Haag said:
Hi! First of all, thanks for your response!

Chris said:
a\s*b would complely ignore the whitespaces, unless you want at least
one.

Yes, I do need at least one whitespace. I don't want to ignore the
whitespaces alltogether, I just want to ignore the number of subsequent
whitespaces.
This expression will never match a c are you trying to match a b b
as well as a b ?

Oops, sorry - the last "b" should have been a "c", as in

a\s+b?\s+c

However, this won't match "a c" (with one space in between).
By that above the rules about the pattern of your strings are that it
can be either.

ac
a (at least one space) d
b (at least one space) c
b (at least one space) d

Yes, that's correct - my question is whether I can go another way than
resolving (a|b\s+)(c|\s+d) to ac|a\s+d|b\s+c|b\s+d (which would
obviously mean to create all possible combinations of the (...|...)
parts). If each bracket hold more than two alternatives, this would
mean an enourmous increase in the size of the RegEx, which I'd like to
avoid, if possible.
Sounds like your homework to me, I don't understand what the format
the strings they are supposed to match.

It's definitely not my homework; it's actually for a vocabulary
training programme the first version of which you can find here:

http://VocDB.de.vu

The input strings are supposed to have the following format:

a and b may be replaced with any characters (or chains thereof) except
\[]()|.
\ preceding either of \[]()| escapes the respective symbol, otherwise
it'll have a special meaning, as described below.
[a] means "a" is optional.
[a|b] means either "a" or "b" or nothing may be written.
(a|b) means either "a" or "b" must be written.

There can be more than one | within each pair of brackets, delimiting
more than two alternatives.

i.e. the whole thing is something slightly Regex-like for
non-programmers.

In version 1 of the above programme, I use my own evaluator for this.
However, for the sake of maintainability, I hoped I could eventually
switch to simply converting those input patterns into RegEx-strings.
If only there were a way to ignore the number of subsuquent whitespaces
without ignoring that there _are_ whitespaces at all at certain places
in the word.

Kind regards,
Florian
 
Kevin said:
If you can explain the requirements of the pattern you're trying to
match, without using any regular expression terminology, I can help.

Hi,
thanks for your response!

Hope this is something like what you meant:
"Users of my programme input sequences of arbitrary Unicode characters
(from now on, referred to as "patterns"). These patterns are supposed
to match other given sequences of Unicode characters (from now on,
referred to as "strings").

Certain subsequences of a pattern may be marked as optional. These may
be found in the string, but need not.
Certain subsequences of a pattern may be marked as a set of
alternatives. Exactly one of them must be found in the string, neither
more nor less.
A pattern will never require more than one space character without any
other characters in between to be found in a string.
A pattern will accept any number of space characters (greater than
zero) without any other characters in between in the string at a
position where a space character is expected.
A pattern will ignore any space characters at the beginning and at the
end of a string.
A pattern will never require any space characters at the beginning and
at the end of a string."

I'm looking for the easiest way to quickly convert the pattern into a
standard regular expression.

Thanks in advance,
Florian
 
That is helpful, but I still have a few questions.
"Users of my programme input sequences of arbitrary Unicode characters
(from now on, referred to as "patterns"). These patterns are supposed
to match other given sequences of Unicode characters (from now on,
referred to as "strings").
I'm looking for the easiest way to quickly convert the pattern into a
standard regular expression.

This sounds like the "patterns" are performing the work of regular
expressions, matching character sequences in strings. What I don't
understand is why you want to create a new regular expression syntax which
your users must learn, then convert it to the original, rather than using
the original? Or perhaps I'm misunderstanding your intention altogether?

Second, what are the limitations of the "arbitrary Unicode characters?"
There are over 16 million Unicode characters, and if we confine ourselves to
a single character set, we are still talking about alphanumeric characters,
punctuation, diacritical characters, and non-printing characters. I will
assume that some of these are not within the set of "arbitrary" characters
you're referencing. But I don't know which ones are allowed, and which ones
are not.
Certain subsequences of a pattern may be marked as optional. These may
be found in the string, but need not.
Certain subsequences of a pattern may be marked as a set of
alternatives. Exactly one of them must be found in the string, neither
more nor less.

Okay, we've discussed "arbitrary," but now you will need to define the term
"marked." As the "patterns" are pure text, the "marks" must also be text.
But what consitutes a "text" character and a "mark" character, and how do
you escape text characters to create marks?

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
 
Hi!

Kevin said:
This sounds like the "patterns" are performing the work of regular
expressions, matching character sequences in strings.

That's right.
What I don't
understand is why you want to create a new regular expression syntax
which your users must learn, then convert it to the original, rather
than using the original?

Some 95% of my users won't have any programming experience whatsoever,
or any computer science background. I doubt usual regular expressions
with all its features would be suitable for those unexperienced users.
I'd expect it very hard to explain for example, why they must write \.
and \? instead of simply writing a fullstop or a question mark.
All the more, my "space character problem" would remain, for my users
do not understand why the pattern "personal computer" will only match
"personal computer", but not "personal computer" (two spaces in
between), for it's the same words. At the same time, they'd consider
writing patterns like "personal *computer" (or even
"personal\s*computer") way too unintuitive to use my programme.

That's why I offer another pattern syntax with a very limited set of a
few special characters which denote very few pattern features (optional
pattern parts, alternative pattern parts) and everything else one could
possibly write into a pattern will be evaluated just as it's been input.
Second, what are the limitations of the "arbitrary Unicode
characters?"

Actually, that means all Unicode characters except spaces. By
"arbitrary", I wanted to express that any characters may appear in any
order without any restrictions in a pattern and should match just like
that.
Pardon for not describing it very accurately :-$
Okay, we've discussed "arbitrary," but now you will need to define
the term "marked." As the "patterns" are pure text, the "marks" must
also be text. But what consitutes a "text" character and a "mark"
character, and how do you escape text characters to create marks?

Right - there are a few Unicode characters which have to be escaped
(which were chosen in a way that they don't appear in regular
vocabulary, anyway). These are: \ ( ) [ ] |
If either of these characters is meant to actually be found in the
string, it has to be preceded by a backslash.
Otherwise, pairs of both ( and ) as well as [ and ] "mark" a part of a
pattern.

Within such a marked part, there may be any number (greater than zero)
of alternative patterns, each separated by a | character.
If ( and ) are used to denote the part of the pattern, exactly one of
the alternative patterns must appear in the string.
If [ and ] are used to denote the part of the pattern, at most one of
the alternative patterns must appear in the string.

Such marked parts may be nested to an unlimited depth, that is, each of
the above alternative patterns may contain marked parts of its own.

That should be all about the syntax of my patterns, as they are already
used in version 1 of my programme.

Regards,
Florian
 
Hi Florian,

I must admit your situation is confusing, and I do find the idea of creating
a "simpler" regular expression syntax is likely to bite you eventually, one
way or another, but requirements are requirements, and my job is to help you
solve your problem. So.....

I'm still a little in the dark as to the full scope of what you're doing,
but it may not be necessary to understand the whole thing in order to solve
this particular problem. If I understand you fully, you're looking for a way
to require at least one space between separate character sequences in a
string, but that some of these character sequences may be "marked" as
optional, in which case no white spaces would be necessary.

If so, I believe this can be solved using a conditional expression:

this(?(?=.)\s+)

This is a regular expression "if" conditional statement, which is a regular
expression "if/else" conditional statement without an "else." The syntax of
a regular expression "if/else" conditional statement is:

(?(?=regex)then|else)

This means that when the regular expression is matched, the "then"
expression is used. When not matched, the "else" expression is used. So, in
the following, it means "look for 'this'". If anything follows it, it must
be followed by at least 1 white space character (Otherwise, not).

For optional matches, you would use the optional operator as you've
illustrated before:

(?:this(?(?=.)\s+))?

In the following, "this," "that," or "other" will match in any combination,
as long as it ends in "other":

(?:this(?(?=\s.)\s+))?(?:that(?(?=\s.)\s+))?(?:other)

matches:

other

this other

that other

this other

It does NOT match:

this

this that

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net



Florian Haag said:
Hi!

Kevin said:
This sounds like the "patterns" are performing the work of regular
expressions, matching character sequences in strings.

That's right.
What I don't
understand is why you want to create a new regular expression syntax
which your users must learn, then convert it to the original, rather
than using the original?

Some 95% of my users won't have any programming experience whatsoever,
or any computer science background. I doubt usual regular expressions
with all its features would be suitable for those unexperienced users.
I'd expect it very hard to explain for example, why they must write \.
and \? instead of simply writing a fullstop or a question mark.
All the more, my "space character problem" would remain, for my users
do not understand why the pattern "personal computer" will only match
"personal computer", but not "personal computer" (two spaces in
between), for it's the same words. At the same time, they'd consider
writing patterns like "personal *computer" (or even
"personal\s*computer") way too unintuitive to use my programme.

That's why I offer another pattern syntax with a very limited set of a
few special characters which denote very few pattern features (optional
pattern parts, alternative pattern parts) and everything else one could
possibly write into a pattern will be evaluated just as it's been input.
Second, what are the limitations of the "arbitrary Unicode
characters?"

Actually, that means all Unicode characters except spaces. By
"arbitrary", I wanted to express that any characters may appear in any
order without any restrictions in a pattern and should match just like
that.
Pardon for not describing it very accurately :-$
Okay, we've discussed "arbitrary," but now you will need to define
the term "marked." As the "patterns" are pure text, the "marks" must
also be text. But what consitutes a "text" character and a "mark"
character, and how do you escape text characters to create marks?

Right - there are a few Unicode characters which have to be escaped
(which were chosen in a way that they don't appear in regular
vocabulary, anyway). These are: \ ( ) [ ] |
If either of these characters is meant to actually be found in the
string, it has to be preceded by a backslash.
Otherwise, pairs of both ( and ) as well as [ and ] "mark" a part of a
pattern.

Within such a marked part, there may be any number (greater than zero)
of alternative patterns, each separated by a | character.
If ( and ) are used to denote the part of the pattern, exactly one of
the alternative patterns must appear in the string.
If [ and ] are used to denote the part of the pattern, at most one of
the alternative patterns must appear in the string.

Such marked parts may be nested to an unlimited depth, that is, each of
the above alternative patterns may contain marked parts of its own.

That should be all about the syntax of my patterns, as they are already
used in version 1 of my programme.

Regards,
Florian
 
Hi, Kevin,
thanks for all your answers.
I must admit your situation is confusing, and I do find the idea of creating
a "simpler" regular expression syntax is likely to bite you eventually, one
way or another, but requirements are requirements, and my job is to help you
solve your problem. So.....

Well, it's not really my idea, rather somewhat common practice. Most
bilingual dictionaries feature a syntax where "green photo(graph)" or
"green photo[graph]" denotes the words "green photo" as well as "green
photograph". I haven't ever seen a dictionary which uses actual
regular expression syntax to print its words (i.e. "green
\sphoto(graph)?").

Anyway, thanks for your explanations regarding conditional statements
in regular expressions. I think I now have enough information to
consider the alternatives and decide how to implement my pattern
matching :-)

Best regards,
Florian
 
Back
Top