Regular Expression help

  • Thread starter Thread starter Mark Downes
  • Start date Start date
M

Mark Downes

I need to split a string based on a character, such as a space (0x20).
The caveat is that I need to ignore the character if it is found
between a pair of some other characters, such as single-quotes. For
example (where _ denotes a space character):
ab_cd_'ef'_'g_h'_i_'j_k_l'
I should get back
ab
cd
ef
g_h
i
j_k_l

I'm sure there is an easy solution to this if you are a regex guru.
As for me, I'm clueless.
 
Mark Downes said:
I need to split a string based on a character, such as a space (0x20).
The caveat is that I need to ignore the character if it is found
between a pair of some other characters, such as single-quotes. For
example (where _ denotes a space character):
ab_cd_'ef'_'g_h'_i_'j_k_l'
I should get back
ab
cd
ef
g_h
i
j_k_l

I'm sure there is an easy solution to this if you are a regex guru.
As for me, I'm clueless.

This seems to work fine:
([^' ]+)|'([^']+)'

Unfortunately, you'll have to remove the 's manually from the matches, or
query both groupings.

Niki
 
Niki Estner said:
Mark Downes said:
I need to split a string based on a character, such as a space (0x20).
The caveat is that I need to ignore the character if it is found
between a pair of some other characters, such as single-quotes. For
example (where _ denotes a space character):
ab_cd_'ef'_'g_h'_i_'j_k_l'
I should get back
ab
cd
ef
g_h
i
j_k_l

I'm sure there is an easy solution to this if you are a regex guru.
As for me, I'm clueless.

This seems to work fine:
([^' ]+)|'([^']+)'

Unfortunately, you'll have to remove the 's manually from the matches, or
query both groupings.

Niki

How do I use this with the Split method in the Regex class?

Mark
 
Mark Downes said:
Niki Estner said:
Mark Downes said:
I need to split a string based on a character, such as a space (0x20).
The caveat is that I need to ignore the character if it is found
between a pair of some other characters, such as single-quotes. For
example (where _ denotes a space character):
ab_cd_'ef'_'g_h'_i_'j_k_l'
I should get back
ab
cd
ef
g_h
i
j_k_l

I'm sure there is an easy solution to this if you are a regex guru.
As for me, I'm clueless.

This seems to work fine:
([^' ]+)|'([^']+)'

Unfortunately, you'll have to remove the 's manually from the matches, or
query both groupings.

Niki

How do I use this with the Split method in the Regex class?

I don't think you can do what you want with the split method, at least not
in a performant way; You'll have to use Regex.Matches to get a list of all
the matches.

Niki
 
Niki Estner said:
Mark Downes said:
Niki Estner said:
I need to split a string based on a character, such as a space (0x20).
The caveat is that I need to ignore the character if it is found
between a pair of some other characters, such as single-quotes. For
example (where _ denotes a space character):
ab_cd_'ef'_'g_h'_i_'j_k_l'
I should get back
ab
cd
ef
g_h
i
j_k_l

I'm sure there is an easy solution to this if you are a regex guru.
As for me, I'm clueless.

This seems to work fine:
([^' ]+)|'([^']+)'

Unfortunately, you'll have to remove the 's manually from the matches, or
query both groupings.

Niki

How do I use this with the Split method in the Regex class?

I don't think you can do what you want with the split method, at least not
in a performant way; You'll have to use Regex.Matches to get a list of all
the matches.

Niki

Thanks Niki for your help. I'm really close to what I need. The only
thing that I'm missing is that I need the regex expression to work
like the String.Split method where two adjacent delimiters gives back
an empty string and a delimiter at the beginning or end of a string
gives back an empty string.

I'm using the Regex.Matches method like you suggested and I get back
the correct data whenever the aforementioned cases are absent from the
string.

Do you have any advice?

Mark
 
Mark Downes said:
Niki Estner said:
Mark Downes said:
I need to split a string based on a character, such as a space
(0x20).
The caveat is that I need to ignore the character if it is found
between a pair of some other characters, such as single-quotes. For
example (where _ denotes a space character):
ab_cd_'ef'_'g_h'_i_'j_k_l'
I should get back
ab
cd
ef
g_h
i
j_k_l

I'm sure there is an easy solution to this if you are a regex guru.
As for me, I'm clueless.

This seems to work fine:
([^' ]+)|'([^']+)'

Unfortunately, you'll have to remove the 's manually from the matches,
or
query both groupings.

Niki

How do I use this with the Split method in the Regex class?

I don't think you can do what you want with the split method, at least
not
in a performant way; You'll have to use Regex.Matches to get a list of
all
the matches.

Niki

Thanks Niki for your help. I'm really close to what I need. The only
thing that I'm missing is that I need the regex expression to work
like the String.Split method where two adjacent delimiters gives back
an empty string and a delimiter at the beginning or end of a string
gives back an empty string.

I'm using the Regex.Matches method like you suggested and I get back
the correct data whenever the aforementioned cases are absent from the
string.

If this doesn't work, please post some sample data that demonstrates your
problem:
([^' ]+)|'([^']*)'

Niki
 
Niki Estner said:
Mark Downes said:
Niki Estner said:
I need to split a string based on a character, such as a space
(0x20).
The caveat is that I need to ignore the character if it is found
between a pair of some other characters, such as single-quotes. For
example (where _ denotes a space character):
ab_cd_'ef'_'g_h'_i_'j_k_l'
I should get back
ab
cd
ef
g_h
i
j_k_l

I'm sure there is an easy solution to this if you are a regex guru.
As for me, I'm clueless.

This seems to work fine:
([^' ]+)|'([^']+)'

Unfortunately, you'll have to remove the 's manually from the matches,
or
query both groupings.

Niki

How do I use this with the Split method in the Regex class?

I don't think you can do what you want with the split method, at least
not
in a performant way; You'll have to use Regex.Matches to get a list of
all
the matches.

Niki

Thanks Niki for your help. I'm really close to what I need. The only
thing that I'm missing is that I need the regex expression to work
like the String.Split method where two adjacent delimiters gives back
an empty string and a delimiter at the beginning or end of a string
gives back an empty string.

I'm using the Regex.Matches method like you suggested and I get back
the correct data whenever the aforementioned cases are absent from the
string.

If this doesn't work, please post some sample data that demonstrates your
problem:
([^' ]+)|'([^']*)'

Niki

Here is an example set of data where I'm trying to split the data from
a comma-delimited string with double-quotes around strings (all one
line, watch out for wraps):
"50-00-0","Formalin",3,4,0,,"DANGER","Corrosive, Flammable","Eyes,
Skin, Respiratory System, Kidney","Goggles, Fshield, Gloves, Fullsuit,
Boots, ChkResp"

I changed the expression to: ([^\",]+)|\"([^\"]*)\" in CSharp.
For the data, I need to get back the following split data:
"50-00-0"
"Formalin"
3
4
0
<empty string>
"DANGER"
"Corrosive, Flammable"
"Eyes, Skin, Respiratory System, Kidney"
"Goggles, Fshield, Gloves, Fullsuit, Boots, ChkResp"

With the expression I get back all of the correct data, but I'm
missing the empty string where there was no value listed in the data.

Mark
 
Mark Downes said:
....
Here is an example set of data where I'm trying to split the data from
a comma-delimited string with double-quotes around strings (all one
line, watch out for wraps):
"50-00-0","Formalin",3,4,0,,"DANGER","Corrosive, Flammable","Eyes,
Skin, Respiratory System, Kidney","Goggles, Fshield, Gloves, Fullsuit,
Boots, ChkResp"

I changed the expression to: ([^\",]+)|\"([^\"]*)\" in CSharp.
For the data, I need to get back the following split data:
"50-00-0"
"Formalin"
3
4
0
<empty string>
"DANGER"
"Corrosive, Flammable"
"Eyes, Skin, Respiratory System, Kidney"
"Goggles, Fshield, Gloves, Fullsuit, Boots, ChkResp"

With the expression I get back all of the correct data, but I'm
missing the empty string where there was no value listed in the data.

The regex engine doesn't like to return empty matches. You can however
include the comma in the match like this:
\G(([^\",]*)|\"([^\"]*)\")\s*(,|$)
And use only the first capture group (to remove the comma).

Does this work?

Niki
 
Niki Estner said:
Mark Downes said:
....
Here is an example set of data where I'm trying to split the data from
a comma-delimited string with double-quotes around strings (all one
line, watch out for wraps):
"50-00-0","Formalin",3,4,0,,"DANGER","Corrosive, Flammable","Eyes,
Skin, Respiratory System, Kidney","Goggles, Fshield, Gloves, Fullsuit,
Boots, ChkResp"

I changed the expression to: ([^\",]+)|\"([^\"]*)\" in CSharp.
For the data, I need to get back the following split data:
"50-00-0"
"Formalin"
3
4
0
<empty string>
"DANGER"
"Corrosive, Flammable"
"Eyes, Skin, Respiratory System, Kidney"
"Goggles, Fshield, Gloves, Fullsuit, Boots, ChkResp"

With the expression I get back all of the correct data, but I'm
missing the empty string where there was no value listed in the data.

The regex engine doesn't like to return empty matches. You can however
include the comma in the match like this:
\G(([^\",]*)|\"([^\"]*)\")\s*(,|$)
And use only the first capture group (to remove the comma).

Does this work?

Niki

I found a solution to the problem in the article 'Managed Extensions:
Parsing CSV Files with Regular Expressions' at
http://www.codeguru.com/Cpp/Cpp/string/net/article.php/c8153

I used it with the Regex.Split method and it worked perfectly. The
regular expression is ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))".

Thanks for all of your help Niki!

Mark
 
Mark Downes said:
...
I found a solution to the problem in the article 'Managed Extensions:
Parsing CSV Files with Regular Expressions' at
http://www.codeguru.com/Cpp/Cpp/string/net/article.php/c8153

I used it with the Regex.Split method and it worked perfectly. The
regular expression is ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))".

Note that:
a) this will have to scan the rest of the string after each comma, so it has
O(n²) performance - bad for long strings of data, and
b) it will only work correctly on a well-formed input string

If your input strings are short, and always well-formed (i.e. even number or
"'s), it should work fine!

Niki
 
Back
Top