Tough (for me) regex case

  • Thread starter Thread starter Rob Perkins
  • Start date Start date
R

Rob Perkins

Hello,

I know I'm not a regular, and I'm new to the arcana of regular
expressions, so I'm a little stuck with two specific cases and I'm
hoping for a genius:

The case I'm most stumped on is an input string like this:

The "quick" brown "fox jumped ""over"" the" lazy dog.

Where what I want to have matched is the quoted strings, except that
paired doublequotes don't count, and I don't want to capture the
quotemarks. In other words, my desired matches are:

<>
quick
fox jumped ""over"" the
</>

If I use: /".+?"/, I get:

<>
"quick"
"fox jumped "
"over"
" the "
</>

....which isn't right. If I use /".+"/, I get:

<>
"quick" brown "fox jumped ""over"" the"
</>

....which also isn't right. So I don't know how to proceed and get the
match of the strings contained in doublequotes, with the paired
doublequotes escaped, and the matches without the quotes.

How would you do it?

Rob
 
Rob said:
Hello,

I know I'm not a regular, and I'm new to the arcana of regular
expressions, so I'm a little stuck with two specific cases and I'm
hoping for a genius:

The case I'm most stumped on is an input string like this:

The "quick" brown "fox jumped ""over"" the" lazy dog.

Where what I want to have matched is the quoted strings, except that
paired doublequotes don't count, and I don't want to capture the
quotemarks. In other words, my desired matches are:

What I do sometimes is:

[1] replace the problem character(s) with something that according to
the specs, can never occur in the string (e.g. s/.../\000/g; with
... the problem char(s), not three dots.
[2] do your thing
[3] undo step 1 (e.g. s/\000/.../g; see remark in step [1] wrt ...)
 
Rob said:
Hello,

I know I'm not a regular, and I'm new to the arcana of regular
expressions, so I'm a little stuck with two specific cases and I'm
hoping for a genius:

The case I'm most stumped on is an input string like this:

The "quick" brown "fox jumped ""over"" the" lazy dog.

Where what I want to have matched is the quoted strings, except that
paired doublequotes don't count, and I don't want to capture the
quotemarks. In other words, my desired matches are:

<>
quick
fox jumped ""over"" the
</>

If I use: /".+?"/, I get:

<>
"quick"
"fox jumped "
"over"
" the "
</>

...which isn't right. If I use /".+"/, I get:

<>
"quick" brown "fox jumped ""over"" the"
</>

...which also isn't right. So I don't know how to proceed and get the
match of the strings contained in doublequotes, with the paired
doublequotes escaped, and the matches without the quotes.

How would you do it?

Rob

Take a look at the Text::Balanced module - here's a short example:

use strict;
use warnings;
use Text::Balanced qw[ extract_delimited ];

my $text = q[The "quick" brown "fox jumped ""over"" the" lazy dog.];

while (my ($extracted, $remainder) =
extract_delimited($text, '"', '[^"]+', '"') )
{
last unless $extracted =~ s#^"(.*)"$#$1#;
print "EXTRACTED TEXT: $extracted\n";
$text = $remainder;
}

HTH - keith
 
"(""|[^"])+"

gets me:
(1) "quick"
(2) "fox jumped ""over"" the"

- the | is an OR-operator which means that after " you match either "" or
any character except "

I've tried to get rid of the initial and ending quotes -- I think it's
possible in the same expression -- but I haven't succeeded -- yet.
tricky? -- yes!

/mortb
 
....and
(?<=")(""|[^"])+(?=")

gets me...

(1) quick
(2) brown (notice the space before)
(3) fox jumped ""over"" the

this got rid of the quotes but introduced the " brown" error
Perhaps you can live with the quotes....

/mortb
 
Perhaps you can live with the quotes....

Not really. I'm localizing an application. The regex is part of the
parser I'm using to identify string constants buried in the code, and
replace them with calls into a hashtable which uses the English string
as the source data for

I worked around it by taking the resulting match and using string
length and positioning data to remove the quotes. Not a big deal, but
I was hoping for something really regex-elegant. Everyone who loves
regex's seems to think it's possible, noone I know personally has
figured out how to do it quite yet...

Rob
 
Rob Perkins said:
and
replace them with calls into a hashtable which uses the English string
as the source data for

should have been

and replace them with calls into a hashtable which uses the English
string as the source data for the hash function.

Rob
 
Does my earlier suggestion about using a named group not work?

The problem with using look-ahead/behind is that since the quotes do not
actually get consumed by the match, they remain available for the next
match. This is why " brown" appears to be a match when it is not. You must
consume the quotes with the match so they will not be re-used. Grouping
constructs can then be used to extract only the part of the match you need.

Brian Davis
http://www.knowdotnet.com
 
Brian Davis said:
Does my earlier suggestion about using a named group not work?

It looked like it would work, though I can't claim a lot of expertise.

I made use of

(?<!")"(?!")(.*?)(?<!")"(?!")

(offered by Steven Kuo on comp.lang.perl.misc)

....which works nicely, and then just used
Microsoft.VisualBasic.Mid(s,2,Microsoft.VisualBasic.Len(s)-2) to strip
the first and last characters, which with that regex are always
doublequotes.

Might be slow 'n' ugly, but I'm not releasing this code.

I would have tried your named group suggestion, but I had to move on
to manipulating resx files so the compiler doesn't barf on 'em, and it
came in later than the other suggestion.

Thank you, though!

Rob
 
Sorry Brian,
I tested your expression "(?<no_quotes>(""|[^"])*)" and it also rendered

(1) "quick"
(2) "fox jumped ""over"" the"

thus still leaving the initial and ending quotes in the strings.

cheers,
mortb
 
The match itself contains the quotes, but the named group 'no_quotes'
contains only the text within the quotes.

As I mentioned in the reply, the match should consume the quotes so they
will not be re-used in other matches (the " brown" problem). You can then
use a named group to extract only a portion of the actual match.


Brian Davis
http://www.knowdotnet.com


mortb said:
Sorry Brian,
I tested your expression "(?<no_quotes>(""|[^"])*)" and it also rendered

(1) "quick"
(2) "fox jumped ""over"" the"

thus still leaving the initial and ending quotes in the strings.

cheers,
mortb
 
I see.
I worte like this (in c#):
using System.Text.RegularExpressions;

string output = "";

foreach(Match oMatch in Regex.Matches("The \"quick\" brown \"fox jumped \"\"over\"\" the\" lazy dog.", @"""(?<no_quotes>(""""|[^\""])*)"""))
output += oMatch.Groups["no_quotes"].Value + "\r\n";

output then contains:

quick
fox jumped ""over"" the

/mortb


Brian Davis said:
The match itself contains the quotes, but the named group 'no_quotes'
contains only the text within the quotes.

As I mentioned in the reply, the match should consume the quotes so they
will not be re-used in other matches (the " brown" problem). You can then
use a named group to extract only a portion of the actual match.


Brian Davis
http://www.knowdotnet.com


mortb said:
Sorry Brian,
I tested your expression "(?<no_quotes>(""|[^"])*)" and it also rendered

(1) "quick"
(2) "fox jumped ""over"" the"

thus still leaving the initial and ending quotes in the strings.

cheers,
mortb

Rob Perkins said:
Does my earlier suggestion about using a named group not work?
 
Got it!!!

$text = 'The "quick" brown "fox jumped ""over"" the" lazy dog.';

while ($text =~ /"(.*?)"/g) {
if ($text =~ /"$_("".*?"")/) {
push @matches, ($1);
print "FOUND: $1\n";
}
elsif ($text =~ /(""$1"")(.*?)"/) {
push @matches, $1;
print "FOUND: $1 \n";
}
else {
push @matches, $1;
print "FOUND: $1\n";
}
}

print "MATCHES: @matches\n";


Prints...

FOUND: quick
FOUND: fox jumped
FOUND: ""over""
FOUND: the
MATCHES: quick fox jumped ""over"" the


mortb said:
...and
(?<=")(""|[^"])+(?=")

gets me...

(1) quick
(2) brown (notice the space before)
(3) fox jumped ""over"" the

this got rid of the quotes but introduced the " brown" error
Perhaps you can live with the quotes....

/mortb

mortb said:
"(""|[^"])+"

gets me:
(1) "quick"
(2) "fox jumped ""over"" the"

- the | is an OR-operator which means that after " you match either "" or
any character except "

I've tried to get rid of the initial and ending quotes -- I think it's
possible in the same expression -- but I haven't succeeded -- yet.
tricky? -- yes!

/mortb
 
Actually, all you need is the last two expressions. Here it is revised:

#!/usr/bin/perl


$text = 'The "quick" brown "fox jumped ""over"" the" lazy dog.';


while ($text =~ /"(.*?)"/g) {
if ($text =~ /".*?(""$1"").*?"/) {
push @matches, $1;
print "REGEX 1: $1 \n";
}
else {
push @matches, $1;
print "REGEX 2: $1\n";
}
}


print "MATCHES: @matches\n";


It prints...

REGEX 2: quick
REGEX 2: fox jumped
REGEX 1: ""over""
REGEX 2: the
MATCHES: quick fox jumped ""over"" the



Brian said:
Got it!!!

$text = 'The "quick" brown "fox jumped ""over"" the" lazy dog.';

while ($text =~ /"(.*?)"/g) {
if ($text =~ /"$_("".*?"")/) {
push @matches, ($1);
print "FOUND: $1\n";
}
elsif ($text =~ /(""$1"")(.*?)"/) {
push @matches, $1;
print "FOUND: $1 \n";
}
else {
push @matches, $1;
print "FOUND: $1\n";
}
}

print "MATCHES: @matches\n";


Prints...

FOUND: quick
FOUND: fox jumped
FOUND: ""over""
FOUND: the
MATCHES: quick fox jumped ""over"" the


mortb said:
...and
(?<=")(""|[^"])+(?=")

gets me...

(1) quick
(2) brown (notice the space before)
(3) fox jumped ""over"" the

this got rid of the quotes but introduced the " brown" error
Perhaps you can live with the quotes....

/mortb

"(""|[^"])+"

gets me:
(1) "quick"
(2) "fox jumped ""over"" the"

- the | is an OR-operator which means that after " you match either "" or
any character except "

I've tried to get rid of the initial and ending quotes -- I think it's
possible in the same expression -- but I haven't succeeded -- yet.
tricky? -- yes!

/mortb
 
Back
Top