C# does not support Unicode characters

Guest · Mar 2, 2004

Is it correct that Unicode characters with code points above 0x10FFFF are not supported by C#?

I have a hard time believing this since it would eliminate some Asian languages. If it is true, is there a workaround? Do other .NET languages support code points > 0x10FFFF?

I appreciate any comments.
Thanks,
Johannes

Julie · Mar 2, 2004

Unicode is defined as 16-bits (max of 0xFFFF).

Julie · Mar 2, 2004

And, yes, C# (natively) supports Unicode.

"The string type represents a string of Unicode characters. string is an alias
for System.String in the .NET Framework."

Jon Skeet [C# MVP] · Mar 2, 2004

Julie said:
Unicode is defined as 16-bits (max of 0xFFFF).

No, it's not. The Basic Multilingual Plane (plane 0) is 64K, but
Unicode is more than that. This is unfortunate as it means we need
surrogate characters etc to cope with systems designed around the 64K
limit.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more information.

Mihai N. · Mar 2, 2004

Is it correct that Unicode characters with code points above 0x10FFFF

are not supported by C#?

There are no Unicode characters above 0x10FFFF.

C# may have a problem with characters above 0xFFFF, since the internal
representation is UTF16. Characters between 0xFFFF and 0x10FFFF are
represented using surogates and some .NET API may be inacurate
(string length, iterations "by char", and others in the same class)

Julie · Mar 2, 2004

You are correct sir. I wasn't aware of the change in in Unicode v3.

Julie · Mar 2, 2004

In light of the Unicode v3 changes that I just became aware of, I retract all
that I've said on the subject in this thread.

Guest · Mar 6, 2004

Thanks for all your responses. It's all clear to me now

UTF-16 - the internal representation of Unicode in the .NET Framework - permits code points up to 10FFFF, which does cover all languages, including Asian languages

The misunderstanding was caused by a syntax error in my code. I was using [\u000000-\u10FFFF] to indicate a range in the character class of regular expression, which is simply the wrong notation (matches 0-FFFF). The correct notation uses upper-case U, as in [\U00000000-\U0010FFFF]. The C# Language Specification is very clear about this. (section Grammar, C1.5) Maybe I will read it after all..

Johanne

C# does not support Unicode characters

Guest

Julie

Julie

Jon Skeet [C# MVP]

Mihai N.

Julie

Julie

Guest