C# does not support Unicode?

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Is it correct that Unicode characters with code points above 0x10FFFF are not supported by C#

I have a hard time believing this since it would eliminate some Asian languages. If it is true, is there a workaround? Do other .NET languages support code points > 0x10FFFF

I appreciate any comments
Thanks
Johannes
 
Check the doco on the char keyword and Char struct. The range is 0x0000 to
0xffff (16-bit number).

--
William Stacey, MVP

Johannes said:
Is it correct that Unicode characters with code points above 0x10FFFF are not supported by C#?

I have a hard time believing this since it would eliminate some Asian
languages. If it is true, is there a workaround? Do other .NET languages
support code points > 0x10FFFF?
 
I believe .Net Framework 1.0 and 1.1 is limited to max UTF-16, Unicode
version 2.0
However, next version looks like supporting up to UTF-32.

From "2.4.1 Unicode character escape sequences"

"A Unicode escape sequence represents the single Unicode character formed
by the hexadecimal number following the "\u" or "\U" characters. Since C#
uses a 16-bit encoding of Unicode code points in characters and string
values, a Unicode character in the range U+10000 to U+10FFFF is not
permitted in a character literal and is represented using a Unicode
surrogate pair in a string literal. Unicode characters with code points
above 0x10FFFF are not supported."
 
Johannes said:
Is it correct that Unicode characters with code points above 0x10FFFF
are not supported by C#?

Which code points are those? You'll have a harder time supporting
characters over 0xffff in .NET as you need surrogate pairs, etc, but I
*thought* everything was within 0-0x10ffff still. (That does, after
all, give a pretty huge scope.) Has that situation changed?
I have a hard time believing this since it would eliminate some Asian
languages. If it is true, is there a workaround? Do other .NET
languages support code points > 0x10FFFF?

It's not really a language issue - .NET itself represents the character
type as a 16 bit entity, as to display Unicode characters outside plane
0 you need to use surrogates and check that whatever you're using to
display them (etc) supports surrogates properly. C# has the \U (as
opposed to \u) escaping for characters above 0xffff, within strings -
and those are then represented as a surrogate pair. That's the only
specific language support I know of in C# for characters outside plane
0, but I would imagine it's probably enough. Most of the work needs to
be done by .NET itself.
 
Thanks for all your responses. It's all clear to me now

UTF-16 - the internal representation of unicode in the .NET Framework - permits code points up to 10FFFF, which does cover all languages, including Asian languages

The misunderstanding was caused by a syntax error in my code. I was using [\u000000-\u10FFFF] to indicate a range in the character class of regular expression, which is simply the wrong notation (matches 0-FFFF). The correct notation uses upper-case U, as in [\U00000000-\U0010FFFF]. The C# Language Specification is very clear about this. (section Grammar, C1.5) Maybe I will read it after all..

Johanne
 
Back
Top