Nasty bug in string handling/comparison

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hi!

Perhaps someone from Microsoft could comment on this behaviour I've just
stumbled across:

String and StringBuilder have a problem with the german ß (szlig).
They assume that 'ß' is equal to 'ss', which is not true!

Calling<pre>"lassen".IndexOf("ß")</pre> yields 2 instead of -1,
a call to <pre>"Größe".IndexOf("ss")</pre> yields 3 instead of -1.
<pre>"lassen".CompareTo("laßen")</pre> returns 0!

String comparison is really essential, how can such a bug slip through?
And what to do about it?

Regards,
Martin Müller
 
Yields the same wrong result with CultureInfo.CurrentCulture (="de-DE") and
with CultureInfo.InvariantCulture.
:(

Thanks for the suggestion anyway...

Martin
 
IndexOf and CompareTo appear to be overloaded with CompareOptions arguments.
I suggest you try your test cases using CompreOptions None and Ordinal to see
what happens. I don't know if this will solve the problem, but it might
yield a clue. Your example is puzzling, to say the least.
 
Martin Müller said:
Hi!

Perhaps someone from Microsoft could comment on this behaviour I've just
stumbled across:

String and StringBuilder have a problem with the german ß (szlig).
They assume that 'ß' is equal to 'ss', which is not true!

Calling<pre>"lassen".IndexOf("ß")</pre> yields 2 instead of -1,
a call to <pre>"Größe".IndexOf("ss")</pre> yields 3 instead of -1.
<pre>"lassen".CompareTo("laßen")</pre> returns 0!

String comparison is really essential, how can such a bug slip through?
And what to do about it?

I can't help you, but you could try asking in the
microsoft.public.dotnet.internationalization newsgroup.

Good luck,
Marc
 
Martin M?ller said:
Perhaps someone from Microsoft could comment on this behaviour I've just
stumbled across:

String and StringBuilder have a problem with the german ? (szlig).
They assume that '?' is equal to 'ss', which is not true!

Calling<pre>"lassen".IndexOf("?")</pre> yields 2 instead of -1,
a call to <pre>"Gr??e".IndexOf("ss")</pre> yields 3 instead of -1.
<pre>"lassen".CompareTo("la?en")</pre> returns 0!

String comparison is really essential, how can such a bug slip through?
And what to do about it?

First accept that it's not a bug - it's behaving as documented. From
the docs for String.IndexOf(string):

<quote>
This method performs a word (case-sensitive and culture-sensitive)
search using the current culture.
</quote>

Now, in order to perform a culture-insensitive search, you can use
CultureInfo.InvariantCulture.CompareInfo.IndexOf.
 
Jon Skeet said:
First accept that it's not a bug - it's behaving as documented. From
the docs for String.IndexOf(string):

<quote>
This method performs a word (case-sensitive and culture-sensitive)
search using the current culture.
</quote>

Now, in order to perform a culture-insensitive search, you can use
CultureInfo.InvariantCulture.CompareInfo.IndexOf.
Dear Jon,

I've tried InvariantCulture's comparison as well and it yields the same
wrong results.
Neither CultureInfo must return equality on "lassen" and "laßen", it's
plainly wrong!

So it seems not only the comparison rules for de-DE are messed up but those
for InvariantCulture as well.

Regards,
Martin
 
AMercer said:
IndexOf and CompareTo appear to be overloaded with CompareOptions arguments.
I suggest you try your test cases using CompreOptions None and Ordinal to see
what happens. I don't know if this will solve the problem, but it might
yield a clue. Your example is puzzling, to say the least.
[...]
Thanks for your suggestion!

CompareOptions.None doesn't change the result.
CompareOptions.Ordinal results in "lassen" and "laßen" being recognized as
different strings.
Unfortunately, CompareOptions.Ordinal cannot be combined with anything else,
so a case insensitive comparison isn't possible this way (although there are
other ways around this, of course).

Regards,
mav
 
Martin M?ller said:
I've tried InvariantCulture's comparison as well and it yields the same
wrong results.
Interesting...

Neither CultureInfo must return equality on "lassen" and "la?en", it's
plainly wrong!

No, with the German culture I believe it's correct (at least in the
view of some Germans; I gather there's some disagreement on the
matter). A quick Google search doesn't give me a definitive answer on
this, but some sites *do* say that "ss" can be used where eszett isn't
available. I can certainly see an argument for allowing the current
rules, but it's one of those nasty internationalization things which is
going to cause problems whatever you do :(
So it seems not only the comparison rules for de-DE are messed up but those
for InvariantCulture as well.

Very interesting - yes, it looks like you're right - or the rules for
ss and the eszett are invariant too.

If you want to do a "plain ordinal" IndexOf, however, you can use
CompareInfo.IndexOf(string, string, int, CompareOptions) and specify
CompareOptions.Ordinal. If you want to do this often, you could write a
short helper method for it.

I suspect there are loads of places with buggy code in this regard -
you could write:

int index = someString.IndexOf ("boss=");
if (index != -1)
{
string boss = someString.Substring (index+5);
...
}

- and end up missing the first character.
 
Jon Skeet said:
No, with the German culture I believe it's correct (at least in the
view of some Germans; I gather there's some disagreement on the
matter). A quick Google search doesn't give me a definitive answer on
this, but some sites *do* say that "ss" can be used where eszett isn't
available. I can certainly see an argument for allowing the current
rules, but it's one of those nasty internationalization things which is
going to cause problems whatever you do :(

As a native german speaker I can assure you that using ss instead of ß or
vice versa is not correct. There are rules where to use one or the other
(with current german spelling rules you still use ß after long vowels, for
example).
The only case where ss is used for ß is when a word is written in all caps,
because there's no capital ß. So it's either "GRUSS" or "GRUß" (greeting),
with the latter form mixing capital and non-capital letters.
Very interesting - yes, it looks like you're right - or the rules for
ss and the eszett are invariant too.

If you want to do a "plain ordinal" IndexOf, however, you can use
CompareInfo.IndexOf(string, string, int, CompareOptions) and specify
CompareOptions.Ordinal. If you want to do this often, you could write a
short helper method for it.

That's the way I chose now...
I suspect there are loads of places with buggy code in this regard -
you could write:

int index = someString.IndexOf ("boss=");
if (index != -1)
{
string boss = someString.Substring (index+5);
...
}

- and end up missing the first character.

That's exactly what happened.
I wrote a search/replace for multiple files, and because I wanted an option
to do case insensitive replace, I couldn't just use string.Replace.
So I wrote my own replace function using IndexOf and ended up with lost
characters or additional 's', depending on the replacement string :(
 
Martin M?ller said:
As a native german speaker I can assure you that using ss instead of ? or
vice versa is not correct. There are rules where to use one or the other
(with current german spelling rules you still use ? after long vowels, for
example).

Yes, but are there not also cases where you'd normally use the eszett
but you use "ss" if eszett is not available? I think that's where it's
aimed.

This is the example I usually hear about when it comes to culture-
sensitive searching, so it must be fairly common.
The only case where ss is used for ? is when a word is written in all caps,
because there's no capital ?. So it's either "GRUSS" or "GRU?" (greeting),
with the latter form mixing capital and non-capital letters.

Or when the medium you're writing with doesn't contain eszett...

That's exactly what happened.
I wrote a search/replace for multiple files, and because I wanted an option
to do case insensitive replace, I couldn't just use string.Replace.
So I wrote my own replace function using IndexOf and ended up with lost
characters or additional 's', depending on the replacement string :(

Right. It really is a nasty potential bug in a lot of code, I suspect.
It *is* doing what the method claims to, *if* you believe that ss
should match eszett (which I know you don't, but I believe others do),
but I suspect most people wouldn't expect it. As it happens, it
explains something someone else was seeing in the C# newsgroup...
 
Back
Top