certain unicode characters seem to freak out IndexOf

  • Thread starter Thread starter Mark
  • Start date Start date
M

Mark

Hi...

One of my coworkers recently discovered that the presence of certain unicode
characters in a string seems to freak out String.IndexOf() and was wondering
why that is. He wrote a little tester program that tested a big chunk of the
range and found some 4000 of them, but I'll just post a little sample with
one here:

char cu = '\u2E01';
string s = cu.ToString();
string s1 = "A " + s;

int idx = s1.IndexOf(s); // returns 0 wrong! (expected 4)
idx = s1.IndexOf(cu); // returns 4

So to sum up, when you hit upon one of these characters IndexOf(char)
returns the right answer but IndexOf(string) - consisting of just the same
char - returns the wrong answer.

Can anyone shed any light on that?

Thanks
Mark
 
Mark said:
One of my coworkers recently discovered that the presence of certain unicode
characters in a string seems to freak out String.IndexOf() and was wondering
why that is. He wrote a little tester program that tested a big chunk of the
range and found some 4000 of them, but I'll just post a little sample with
one here:

char cu = '\u2E01';
string s = cu.ToString();
string s1 = "A " + s;

int idx = s1.IndexOf(s); // returns 0 wrong! (expected 4)
idx = s1.IndexOf(cu); // returns 4

So to sum up, when you hit upon one of these characters IndexOf(char)
returns the right answer but IndexOf(string) - consisting of just the same
char - returns the wrong answer.
Actually, U+2E01 (RIGHT ANGLE DOTTED SUBSTITUTION MARKER) in string form is
matched at the beginning of *any* non-empty string: "Test".IndexOf("\u2e01")
will return 0. "Test".IndexOf("\u2e01s") will return 2, matching "s" as if
the other character wasn't there.

..NET simply considers these characters "empty" in strings under English
collation rules. Whether that's an error in the tables or by design is
another matter. I'm guessing the latter.

Of course, the problem disappears if you use .IndexOf(..,
StringComparison.Ordinal), because that doesn't use culture-specific
searches and just searches for that exact character, which is exactly what
String.IndexOf(Char) does as well.

The answer isn't "right" or "wrong" so straightforwardly. For example,
"\u0061\u030a".IndexOf("\u00e5") will return 0 because both character
sequences match LATIN SMALL LETTER A WITH RING ABOVE, but
"\u0061\u030a".IndexOf('\u00e5') returns -1 because an ordinal comparison
does not match.

That's Unicode for 'ya.
 
Back
Top