certain unicode characters seem to freak out IndexOf

Mark · Apr 9, 2008

Hi...

One of my coworkers recently discovered that the presence of certain unicode
characters in a string seems to freak out String.IndexOf() and was wondering
why that is. He wrote a little tester program that tested a big chunk of the
range and found some 4000 of them, but I'll just post a little sample with
one here:

char cu = '\u2E01';
string s = cu.ToString();
string s1 = "A " + s;

int idx = s1.IndexOf(s); // returns 0 wrong! (expected 4)
idx = s1.IndexOf(cu); // returns 4

So to sum up, when you hit upon one of these characters IndexOf(char)
returns the right answer but IndexOf(string) - consisting of just the same
char - returns the wrong answer.

Can anyone shed any light on that?

Thanks
Mark

Jeroen Mostert · Apr 9, 2008

Mark said:
One of my coworkers recently discovered that the presence of certain unicode
characters in a string seems to freak out String.IndexOf() and was wondering
why that is. He wrote a little tester program that tested a big chunk of the
range and found some 4000 of them, but I'll just post a little sample with
one here:

char cu = '\u2E01';
string s = cu.ToString();
string s1 = "A " + s;

int idx = s1.IndexOf(s); // returns 0 wrong! (expected 4)
idx = s1.IndexOf(cu); // returns 4

So to sum up, when you hit upon one of these characters IndexOf(char)
returns the right answer but IndexOf(string) - consisting of just the same
char - returns the wrong answer.

Actually, U+2E01 (RIGHT ANGLE DOTTED SUBSTITUTION MARKER) in string form is
matched at the beginning of *any* non-empty string: "Test".IndexOf("\u2e01")
will return 0. "Test".IndexOf("\u2e01s") will return 2, matching "s" as if
the other character wasn't there.

..NET simply considers these characters "empty" in strings under English
collation rules. Whether that's an error in the tables or by design is
another matter. I'm guessing the latter.

Of course, the problem disappears if you use .IndexOf(..,
StringComparison.Ordinal), because that doesn't use culture-specific
searches and just searches for that exact character, which is exactly what
String.IndexOf(Char) does as well.

The answer isn't "right" or "wrong" so straightforwardly. For example,
"\u0061\u030a".IndexOf("\u00e5") will return 0 because both character
sequences match LATIN SMALL LETTER A WITH RING ABOVE, but
"\u0061\u030a".IndexOf('\u00e5') returns -1 because an ordinal comparison
does not match.

That's Unicode for 'ya.

certain unicode characters seem to freak out IndexOf

Mark

Jeroen Mostert