When is a string not a string?

Markus · Jan 3, 2004

Hello all,

I have the strange problem that a string isn't equal to
the same character sequence in my code. e.g.

1st assign:
String str="Hello"
then later:
(str=="Hello") results in false.

The actual sequence involves writing and reading to a file
and I'm hoping it contains some oversight:

I assign the value "Hello" to my string str. I use
StreamWriter.Write(str) to write to a file. A custom
function reads the file and parses it into words by using,
in essence:

int i=StreamReader.Read();
char c=(char)i;
str+=c;

When it is finished, the string appears to be
reconstructed. It has the same length and contains the
same sequence of individual characters, as far as I can
verify in the Immediate Command Window. Also,
String.Compare(str,"Hello") results in 0 (or equals), but
String.Equals(str,"Hello") results in false, and the
switch statement in my code won't assign the string to the
correct case (which is how I discovered the problem).

So what am I missing, some hidden encoding going on
somewhere?

Many thanks for any reponse,
Markus

Jon Skeet [C# MVP] · Jan 3, 2004

Markus said:
I have the strange problem that a string isn't equal to
the same character sequence in my code. e.g.

1st assign:
String str="Hello"
then later:
(str=="Hello") results in false.

Are you *actually* testing it with "Hello" or with some other string
which, for instance, contains accented characters?

The actual sequence involves writing and reading to a file
and I'm hoping it contains some oversight:

I assign the value "Hello" to my string str. I use
StreamWriter.Write(str) to write to a file. A custom
function reads the file and parses it into words by using,
in essence:

int i=StreamReader.Read();
char c=(char)i;
str+=c;

Firstly, that's a bad way of doing it - it's really inefficient as it
requires a new string for each character. You should use a
StringBuilder if you must, but better yet would be to use
StreamReader.ReadLine or StreamReader.ReadToEnd.

When it is finished, the string appears to be
reconstructed. It has the same length and contains the
same sequence of individual characters, as far as I can
verify in the Immediate Command Window. Also,
String.Compare(str,"Hello") results in 0 (or equals), but
String.Equals(str,"Hello") results in false, and the
switch statement in my code won't assign the string to the
correct case (which is how I discovered the problem).

So what am I missing, some hidden encoding going on
somewhere?

I suggest you execute the following - in code, not in the command
window:

Console.WriteLine (str.Length);
foreach (char c in str)
{
Console.WriteLine ("{0} {1:d}", c, c);
}

to print out the unicode value of each character.

If you could post a short but complete example which demonstrates the
problem, that would really help - see
http://www.pobox.com/~skeet/csharp/complete.html for what I mean by
that.

Markus · Jan 5, 2004

Thanks for the reply -

Through your recommendations, I figured out the problem -
it was is actually a bit stranger than I thought. There
were 2 separate issues.

I wrote a simple test program I could post and it didn't
duplicate the problem. Then I copied my code verbatim into
the test program and I discovered the problem with my
code - I was appending the character '\0' to terminate the
string. The String appends '\0' and tracks it as a new
character. So:
"h" has a Length of 1
while
"h\0" has a length of 2.

I am used to manually creating strings while parsing files
so appending '\0' was a force of habit (btw, I am trying
to parse words, not lines, and there's no equivalent to
fscanf that I know of, which is unfortunate; I could use
StringBuilder, will consider it for more intensive
parsing). I output everything to the console and it became
clear. Thanks for the tip.

The strange thing is, I couldn't debug the code in the
immediate window. The way strings are processed in this
window appears to be inconsistent with the same code in
executable format. This is why I didn't detect the problem
myself. You can try this quite easily with the following:

using System;

namespace ConsoleApplication1
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
string s;
s="o\0";
Console.WriteLine("{0}",s.Length);
s=Console.ReadLine();
}
}
}

Break into this code at the "s=Console.ReadLine()".
Now in the immediate window, type in s.Length.

In the console window it appears as 2, in the immediate
window it appears as 1!

At any rate, thanks for your tips. Hope this helps anyone
else who encounters bizarre effects in the immediate
window - maybe Microsoft could consider fixing this.

Markus

Daniel Billingsley · Jan 7, 2004

Jon, based on my testing and research I'm going to have to object when
somebody says code like this is "bad". It all depends, particularly on the
number of concats and the overall length of the final string. In some cases
this code may actually be faster.

The "concatenation is always bad, stringbuilder is always better" dogma is a
debunked myth as far as I'm concerned.

I would still prefer an alternative like you suggest + a regex or .Split()
to separate the words though, even if it were at the expense of the slight
performance hit.

Jon Skeet [C# MVP] · Jan 7, 2004

Daniel Billingsley said:
Jon, based on my testing and research I'm going to have to object when
somebody says code like this is "bad". It all depends, particularly on the
number of concats and the overall length of the final string. In some cases
this code may actually be faster.

The "concatenation is always bad, stringbuilder is always better" dogma is a
debunked myth as far as I'm concerned.

In a loop where you don't know the length, it's always bad practice.
It's particularly bad practice when there are perfectly acceptable
alternatives which will read a block at a time.

Don't forget that there's more going on here than just the string
concatenation - there's the calls to Read as well, which may well have
more impact than the concatenation for relatively small loops.

The performance isn't really my main gripe though - it's the "smell" of
the code as Martin Fowler would put it. Why read a character at a time
from a stream, which involves casting (which is a smell to start with,
if not a terribly bad one) when you can read a whole block?

Alvin Bruney · Jan 7, 2004

be careful with split also in a loop. it is just as evil as string with
temporary objects.

Daniel Billingsley · Jan 8, 2004

well, he ultimately wants the words split up, so one way or another that's
the result.

Daniel Billingsley · Jan 8, 2004

I wasn't disagreeing with the block part, only the concat vs stringbuilder
part. I think the whole logic could be handled with two simple lines of
code as you suggest.. the read and the .Split (or regex maybe)... which is
better in my mind for readability even if it were a bit slower.

Which is to say, I agree the OP code was bad for that reason, but I'm
hesitant to call any code "bad" simply because it uses concatenation and not
stringbuilder as you did. That's all, really just expressing an opinion.

When is a string not a string?

Markus

Jon Skeet [C# MVP]

Markus

Daniel Billingsley

Jon Skeet [C# MVP]

Alvin Bruney

Daniel Billingsley

Daniel Billingsley