Interating over the characters in a string

Carlo Razzeto · Sep 23, 2004

Hello, I have a question in regards to .Net string maniplulation. I have a
question in regards to interating over individual characters in a string.
The problem is I have a CSV parser that will successfully parse out quoted
csv files, the only issue is it will leave the leading and ending quotes in
tact. Before I go on I do realize since it's a CSV I could do
stringval.replace( "\"", "" ); but I wanted to take the chance to learn out
to iterate over string values. Anyway, the problem is what I had written
originally to do this was:

if ( stringval[0] == '"' ) {
stringval = stringval.substring( 1, ( stringval.length - 1 ) );
}
if( stringval[( stringval.length - 1 )] == '"' ) {
stringval = stringval.substring( 0, ( stringval.length - 2 ) );
}

I wasn't stripping off the last " ever and I realize now that the problem
has to do with .Net storing strings in UNICODE, which allows for character
pairs to reprisent a single character. So my question here is, how does one
iterate over the character values in a string and replace it's value if
neccessary?

Carlo

Richard Blewett [DevelopMentor] · Sep 23, 2004

Does this demonstrate what you want to do?

using System;
using System.Text;

class App
{
static void Main(string[] args)
{
string s = "\"hello\",\"world\",\"the\",\"quick\",\"brown\",\"fox\"";

string[] bits = s.Split(new char[]{'\"', ','});
Console.WriteLine(s);

StringBuilder sb = new StringBuilder();

foreach( string b in bits )
{
if( b.Length!= 0 )
{
Console.WriteLine(b);
sb.Append(b);
}
}

Console.WriteLine(sb.ToString());
}
}

Regards

Richard Blewett - DevelopMentor
http://staff.develop.com/richardb/weblog

nntp://news.microsoft.com/microsoft.public.dotnet.framework/<[email protected]>

Hello, I have a question in regards to .Net string maniplulation. I have a
question in regards to interating over individual characters in a string.
The problem is I have a CSV parser that will successfully parse out quoted
csv files, the only issue is it will leave the leading and ending quotes in
tact. Before I go on I do realize since it's a CSV I could do
stringval.replace( "\"", "" ); but I wanted to take the chance to learn out
to iterate over string values. Anyway, the problem is what I had written
originally to do this was:

if ( stringval[0] == '"' ) {
stringval = stringval.substring( 1, ( stringval.length - 1 ) );
}
if( stringval[( stringval.length - 1 )] == '"' ) {
stringval = stringval.substring( 0, ( stringval.length - 2 ) );
}

I wasn't stripping off the last " ever and I realize now that the problem
has to do with .Net storing strings in UNICODE, which allows for character
pairs to reprisent a single character. So my question here is, how does one
iterate over the character values in a string and replace it's value if
neccessary?

Carlo

---
Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.766 / Virus Database: 513 - Release Date: 17/09/2004

[microsoft.public.dotnet.framework]

Dmitriy Lapshin [C# / .NET MVP] · Sep 23, 2004

Hi,

Use the System.Globalization.StringInfo class to iterate over Unicode
characters.

Jon Skeet [C# MVP] · Sep 23, 2004

Carlo Razzeto said:
Hello, I have a question in regards to .Net string maniplulation. I have a
question in regards to interating over individual characters in a string.
The problem is I have a CSV parser that will successfully parse out quoted
csv files, the only issue is it will leave the leading and ending quotes in
tact. Before I go on I do realize since it's a CSV I could do
stringval.replace( "\"", "" ); but I wanted to take the chance to learn out
to iterate over string values. Anyway, the problem is what I had written
originally to do this was:

if ( stringval[0] == '"' ) {
stringval = stringval.substring( 1, ( stringval.length - 1 ) );
}
if( stringval[( stringval.length - 1 )] == '"' ) {
stringval = stringval.substring( 0, ( stringval.length - 2 ) );
}

I wasn't stripping off the last " ever and I realize now that the problem
has to do with .Net storing strings in UNICODE, which allows for character
pairs to reprisent a single character. So my question here is, how does one
iterate over the character values in a string and replace it's value if
neccessary?

Although Unicode (UTF-16 in particular) allows surrogate pairs, I don't
think that's the real problem. What exactly are you seeing?

Note that your code as posted above will remove the character *before*
the final " as well.

Nick Malik · Sep 23, 2004

Hi Carlo,

Jon is right... your code, as posted, will convert ("Fubar") to (Fuba)

I, too, am confused by the error you are getting.

Also, when parsing, beware of simple solutions. In the CSV format, a
double-quote character can be embedded within a string. I believe it
appears twice, as in:
"The word ""misspelled"" is often spelled incorrectly"

(Not 100% certain about that, but my memory tells me that this is the case.
Also, commas can occur in the quoted string too, so Split() may not work
very well either.)

Good Luck,
--- Nick

Jon Skeet said:
Carlo Razzeto said:

Hello, I have a question in regards to .Net string maniplulation. I have a
question in regards to interating over individual characters in a string.
The problem is I have a CSV parser that will successfully parse out quoted
csv files, the only issue is it will leave the leading and ending quotes in
tact. Before I go on I do realize since it's a CSV I could do
stringval.replace( "\"", "" ); but I wanted to take the chance to learn out
to iterate over string values. Anyway, the problem is what I had written
originally to do this was:

if ( stringval[0] == '"' ) {
stringval = stringval.substring( 1, ( stringval.length - 1 ) );
}
if( stringval[( stringval.length - 1 )] == '"' ) {
stringval = stringval.substring( 0, ( stringval.length - 2 ) );
}

I wasn't stripping off the last " ever and I realize now that the problem
has to do with .Net storing strings in UNICODE, which allows for character
pairs to reprisent a single character. So my question here is, how does one
iterate over the character values in a string and replace it's value if
neccessary?

Click to expand...

Although Unicode (UTF-16 in particular) allows surrogate pairs, I don't
think that's the real problem. What exactly are you seeing?

Note that your code as posted above will remove the character *before*
the final " as well.

Carlo Razzeto · Sep 24, 2004

Jon Skeet said:
Although Unicode (UTF-16 in particular) allows surrogate pairs, I don't
think that's the real problem. What exactly are you seeing?

Note that your code as posted above will remove the character *before*
the final " as well.

CSV Example:
12345,"Some, Text",9/18/2003

Array Values:
array[0]=>12345
array[1]=>"Some, Text"
array[2]=>"9/18/2003

The challenge, remove the " character from the begining and end of the
string ( not using string.replace() ). What doesn't seem to work:
if ( array[1][0] == '"' ) {
array[1] = array[1].Substring( 1, array[1].length - 1 );
}
if( array[1][( array[1].length - 1 )] == '"' ) {
array[1] = array[1].Substring( 0, array[1].length - 2 );
}

Jon Skeet [C# MVP] · Sep 24, 2004

Carlo Razzeto said:
Although Unicode (UTF-16 in particular) allows surrogate pairs, I don't
think that's the real problem. What exactly are you seeing?

Note that your code as posted above will remove the character *before*
the final " as well.

Click to expand...

CSV Example:
12345,"Some, Text",9/18/2003

Array Values:
array[0]=>12345
array[1]=>"Some, Text"
array[2]=>"9/18/2003

The challenge, remove the " character from the begining and end of the
string ( not using string.replace() ). What doesn't seem to work:
if ( array[1][0] == '"' ) {
array[1] = array[1].Substring( 1, array[1].length - 1 );
}
if( array[1][( array[1].length - 1 )] == '"' ) {
array[1] = array[1].Substring( 0, array[1].length - 2 );
}

As I said, your code is removing the final real character as well as
the ", because in your second bit you're asking it to copy
array[1].Length-2 characters rather than array[1].Length-1 characters.

Nick Malik · Sep 25, 2004

Hello Carlo,

You have three logic flaws in your code.
1) You are removing the ending quote even when the beginning quote is not
there. Logically, you don't want to do that.
2) You are removing the beginning quote even when the ending quote is not
there. This is also normally outside the correct behavior.
3) When removing the ending quote, you are also stripping off the last
character of the text string.

You also appear to be running into a situation where your code appears to
show a double-quote ending a field, but the code is not finding the
double-quote character in the last position of the character string. This
is possible if you have OTHER characters in your string that are surrogate
pair or combining pair characters, and that the last position in your string
is not, in fact, a double-quote character. Look carefully at your data to
see if this is the situation you are running in to and add code to your
system to detect it.

The double quote character itself is not part of a surrogate pair as far as
I know.

First off, some good reading pointers for dealing with Unicode strings...
http://www.informit.com/guides/printerfriendly.asp?g=dotnet&seqNum=115

http://msdn.microsoft.com/library/d...balizationtextelementenumeratorclasstopic.asp

http://msdn.microsoft.com/library/d.../html/frlrfsystemstringclasscomparetopic3.asp

To address the logic problems I mentioned above, use code like this. I used
a temporary string variable to make the code easier to read. The code you
had below is slightly more efficient because it incurs less of a garbage
collection overhead. (Caveat: I did not compile this code in VS before
posting)

for (int ct = 0; ct++; ct < array.length)
{
string hold = array[ct];
if ((hold[0] == "\"") && (hold[hold.length-1] == "\""))
{
array[ct] = hold.substring(1,hold.length-2);
}
}

I hope this helps,
--- Nick Malik

Carlo Razzeto said:
Jon Skeet said:

Although Unicode (UTF-16 in particular) allows surrogate pairs, I don't
think that's the real problem. What exactly are you seeing?

Note that your code as posted above will remove the character *before*
the final " as well.

Click to expand...

CSV Example:
12345,"Some, Text",9/18/2003

Array Values:
array[0]=>12345
array[1]=>"Some, Text"
array[2]=>"9/18/2003

The challenge, remove the " character from the begining and end of the
string ( not using string.replace() ). What doesn't seem to work:
if ( array[1][0] == '"' ) {
array[1] = array[1].Substring( 1, array[1].length - 1 );
}
if( array[1][( array[1].length - 1 )] == '"' ) {
array[1] = array[1].Substring( 0, array[1].length - 2 );
}

Interating over the characters in a string

Carlo Razzeto

Richard Blewett [DevelopMentor]

Dmitriy Lapshin [C# / .NET MVP]

Jon Skeet [C# MVP]

Nick Malik

Carlo Razzeto

Jon Skeet [C# MVP]

Nick Malik