UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET

  • Thread starter Thread starter Chris Mullins
  • Start date Start date
C

Chris Mullins

I've spent a bit of time over the last year trying to implement RFC 3454
(Preparation of Internationalized Strings, aka 'StringPrep').

This RFC is also a dependency for RFC 3491 (Internationalized Domain Names /
IDNA) which is something that I also need to support.

The problem that I've been struggling with in .NET is that of Unicode Code
Points > 0xFFFF. These points are encoded into UTF8 using the Surrogate Pair
encoding scheme that the Unicode Spec defined in section 3.7 of the Unicode
Spec (http://www.unicode.org/book/ch03.pdf).

Related to Surrogate Pairs, are the whole set of Unicode Combining
characters.

The problem, then, is this:

When I iterate over a string using the .NET StringInfo class I get a set of
graphemes. These graphemes correctly handle the combining characters and
surrogate pairs, and end up giving me a single UTF-32 Code Point for each
grapheme.

BUT, let's say the original string had U:0x10FF1 encoded as a UTF8 surrogate
pair. This character is illegal in a particular stringprep profile.

The original string also had a combining character sequence U:301 + U:302
(for example) and the grapheme that the StringInfo class reports for this is
also U:0x10FF1.

The problem is that each of the combining characters IS legal in the
stringprep profile, but I have no way of telling if the original data was
the (illegal) UTF-32 code point, or the (legal) combining characters.

Has anyone implemented any of this stuff in .NET ?
 
Hello Chris,

I am not familar with International programming and Unicode topic. I will
forward your question to our internal team to see whether they have any
comments on it.

At the same time, if any community member has any idea, please feel free to
share here for further discussion. :)

Thanks.

Best regards,
Yanhong Huang
Microsoft Community Support

Get Secure! ¨C www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Hi Chris,

Here is the response that I got from our Windows Globalization Software
Design Engineer.

----------------

A few corrections:

1) Surrogate code units are illegal in UTF-32 (only full code points are
acceptable).
2) Surrogate code units are also illegal in UTF-8 (only the 4-byte from of
supplementary characters is acceptable).

For the above, it is legal to accept them if a process wants to for
backcompat reasons , but it is completely illegal for a conformant process
to emit them.

Note that grapheme clusters (called Text Elements in .NET) are not always
representable as single UTF-32 code points (there are many composite forms
that have no precomposed from in them, since the precomposed form is only
added to Unicode for backcompat reasons).

So, given the above (which seems to contradict your problem descripition in
several places), what is the question, exactly?
------------------

Thanks very much.

Best regards,
Yanhong Huang
Microsoft Community Support

Get Secure! ¨C www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.
 
I try to post a more clear explination.

I've got Unicode Code points that are larger than 0xFFFF - these are encoded
(according to the Unicode Spec) into UTF-8 using the Surrogate Pair
Algorithim. The .NET UTF8 implementation seems to properly handle Surrogate
Pairs quite well - if I encode surrogate pairs into UTF8, iterate over them
using the StringInfo methods, I get the proper graphemes.

Point 2 - Surrogate code units are illegal in UTF8 doesn't make sense. The
Unicode spec makes no mention of this that I can find, and I see no other
way of encoding 32 bit codepoints into UTF8.

At the end of the day my question is this: I need a full implementation of
RFC 3454 (aka 'StringPrep', ftp://ftp.isi.edu/in-notes/rfc3454.txt) that
works with all codepoints specified in the RFC (many of which are 32 bit
codepoints). Is this RFC possible to implement in .NET?

For Example - One of the steps in StringPrep is to compare all the
codepoints in the String against the various tables of "illegal characters".
If any of the CodePoints in the string matches one of the illegal
characters, the string has failed StringPrep. Many of these codepoints
required a 32 bit representation, hence in UTF8 they must be encoded as
surrogate pairs.

So far the only way I've been able to look at the Surrogate Pairs as a
single code point has been the following code:

Dim si As New System.Globalization.StringInfo
Dim myTEE As System.Globalization.TextElementEnumerator =
si.GetTextElementEnumerator(stringToTest)
myTEE.Reset()
While myTEE.MoveNext()
Dim CodePoint As Integer
Dim grapheme As String = myTEE.GetTextElement
If grapheme.Length > 1 Then
Dim uc As Char = grapheme.Chars(0)
Dim lc As Char = grapheme.Chars(1)
CodePoint = ((AscW(uc) - &HD800) * &H400) + AscW(lc) - &HDC00 +
&H10000
Else
CodePoint = AscW(grapheme)
End If

If ResourcePrepTables.ContainsKey(CodePoint) Then Return False
End While

In the code above, the ResourcePrepTables is a hash table with all of the
"illegal" characters (represented as int32) stored in it. The CodePoint
algorithm is taken from Section 3.7 of http://www.unicode.org/book/ch03.pdf

My algorithm is obviously flawed, as it's using Grapheme's rather than code
points so things like combining characters are falling through the cracks -
but I am at a loss as to determine any other way to do it.

Please, please, suggest a viable alternative!
 
Chris Mullins said:
I try to post a more clear explination.

I've got Unicode Code points that are larger than 0xFFFF - these are encoded
(according to the Unicode Spec) into UTF-8 using the Surrogate Pair
Algorithim. The .NET UTF8 implementation seems to properly handle Surrogate
Pairs quite well - if I encode surrogate pairs into UTF8, iterate over them
using the StringInfo methods, I get the proper graphemes.

That just means that the UTF-8 encoder can correctly detect that a
surrogate pair is present, work out the UCS-4 character represented,
and correctly encode it.
Point 2 - Surrogate code units are illegal in UTF8 doesn't make sense. The
Unicode spec makes no mention of this that I can find, and I see no other
way of encoding 32 bit codepoints into UTF8.

It *does* make sense. "Surrogate pair" is a UTF-16 concept. Instead of
encoding a surrogate pair into two separate UTF-16 values, it considers
the single UCS-4 character being represented, and encode that 32 bit
value.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 for more
information.

I'm afraid I don't have any more information for you than that :(
 
[UTF-8 Encoding]
See http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 for more
information.

I'm afraid I don't have any more information for you than that :(

I guess what I really want is to end up with the UTF-32 encoding of the
string, and then scan that for that appropiate code points.

The issues I keep coming back to is that I can't scan the string by using
the "char" array that underlies it, as that won't let me properly see the
code points that are over 0xFFFF.

I originally thought scanning using the StringInfo class would solve this
(and at first blush, it appeared to), by giving me the Graphemes, which I
could then turn into CodePoints, but this turns out to be wrong in a fair
number of cases.

Microsoft has to have implemented this stuff somewhere - IDNA and several
other Internet related RFCs have this as a requirment - but so far I can't
find anything on it...
 
I keep making a verbage mistake:

The byte streams that I pull off a socket are UTF-8 encoded strings. I then
put these into .NET Strings, so now they're UTF-16 encoded.

This means when I perform scanning (using the StringInfo methods), I'm
actually scanning UTF-16 strings, and the entire Surrogate Pair madness
really is there. Sorry for the poor termonology on my part.
 
Chris Mullins said:
I keep making a verbage mistake:

The byte streams that I pull off a socket are UTF-8 encoded strings. I then
put these into .NET Strings, so now they're UTF-16 encoded.

This means when I perform scanning (using the StringInfo methods), I'm
actually scanning UTF-16 strings, and the entire Surrogate Pair madness
really is there. Sorry for the poor termonology on my part.

Well, you can fairly easily convert the UTF-16 strings into UTF-32
sequences of ints (or uints) just by detecting surrogates yourself and
converting them - that part is fairly straightforward, and I could
probably help you with it if you want.

Whether that will help you to do what you want after that, I'm not
sure. Let me know if you want a UTF-16 -> UTF-32 routine though.
 
Jon Skeet said:
Well, you can fairly easily convert the UTF-16 strings into UTF-32
sequences of ints (or uints) just by detecting surrogates yourself and
converting them - that part is fairly straightforward, and I could
probably help you with it if you want.

Whether that will help you to do what you want after that, I'm not
sure. Let me know if you want a UTF-16 -> UTF-32 routine though.

If you have the code handy that would turn a UTF-16 String into an array of
CodePoints, I would appriciate it.

I will also need something that'll take sequence of UTF-32 characters and
convert them back into a UTF-16 string.

(one of the steps in StringPrep is a "replace" step, where certain
CodePoints are replaced with other codepoints - this essentially is a really
fancy .ToLower() algorithm...)
 
Hi Chris,

Please refer to inline for the answers from our dev team. Please feel free
to post if there is anything unclear.

Thanks very much.

Best regards,
Yanhong Huang
Microsoft Community Support

Get Secure! ¨C www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.

--------------------
!From: "Chris Mullins" <[email protected]>
!References: <#[email protected]>
<[email protected]>
!Subject: Re: UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and
.NET
!Date: Fri, 23 Apr 2004 11:03:52 -0700
!Lines: 101
!X-Priority: 3
!X-MSMail-Priority: Normal
!X-Newsreader: Microsoft Outlook Express 6.00.2800.1409
!X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1409
!Message-ID: <OXk#[email protected]>
!Newsgroups: microsoft.public.dotnet.framework
!NNTP-Posting-Host: dcn242-16.dcn.davis.ca.us 168.150.242.16
!Path:
cpmsftngxa10.phx.gbl!TK2MSFTNGXA06.phx.gbl!TK2MSFTNGXA05.phx.gbl!TK2MSFTNGP0
8.phx.gbl!TK2MSFTNGP12.phx.gbl
!Xref: cpmsftngxa10.phx.gbl microsoft.public.dotnet.framework:69982
!X-Tomcat-NG: microsoft.public.dotnet.framework
!
!I try to post a more clear explination.
!
!I've got Unicode Code points that are larger than 0xFFFF - these are
encoded
!(according to the Unicode Spec) into UTF-8 using the Surrogate Pair
!Algorithim. The .NET UTF8 implementation seems to properly handle Surrogate
!Pairs quite well - if I encode surrogate pairs into UTF8, iterate over them
!using the StringInfo methods, I get the proper graphemes.
!

The customer is midunderstanding how both Unicode and the .NET framework
work here. The StringInfo methods only work with UTF-16 (where surrogate
pairs are legal). They get in UTF-8 (in the four byte form), they convert
it to UTF-16, and they see surrogate pairs. They did not havre surrogate
pairs before.


!Point 2 - Surrogate code units are illegal in UTF8 doesn't make sense. The
!Unicode spec makes no mention of this that I can find, and I see no other
!way of encoding 32 bit codepoints into UTF8.
!
Actually, the definition of UTF-8 is entirely clear on this matter, and the
definition as of Unicode 3.2 makes the "six byte form" illegal.


!At the end of the day my question is this: I need a full implementation of
!RFC 3454 (aka 'StringPrep', ftp://ftp.isi.edu/in-notes/rfc3454.txt) that
!works with all codepoints specified in the RFC (many of which are 32 bit
!codepoints). Is this RFC possible to implement in .NET?
!
Yes it is. You convert to UTF-8 with the framework and it will use the
legal form.


!For Example - One of the steps in StringPrep is to compare all the
!codepoints in the String against the various tables of "illegal
characters".
!If any of the CodePoints in the string matches one of the illegal
!characters, the string has failed StringPrep. Many of these codepoints
!required a 32 bit representation, hence in UTF8 they must be encoded as
!surrogate pairs.
!
See above. This will all work.


!So far the only way I've been able to look at the Surrogate Pairs as a
!single code point has been the following code:
!
!Dim si As New System.Globalization.StringInfo
!Dim myTEE As System.Globalization.TextElementEnumerator =
!si.GetTextElementEnumerator(stringToTest)
!myTEE.Reset()
!While myTEE.MoveNext()
! Dim CodePoint As Integer
! Dim grapheme As String = myTEE.GetTextElement
! If grapheme.Length > 1 Then
! Dim uc As Char = grapheme.Chars(0)
! Dim lc As Char = grapheme.Chars(1)
! CodePoint = ((AscW(uc) - &HD800) * &H400) + AscW(lc) - &HDC00 +
!&H10000
! Else
! CodePoint = AscW(grapheme)
! End If
!
! If ResourcePrepTables.ContainsKey(CodePoint) Then Return False
!End While
!
!In the code above, the ResourcePrepTables is a hash table with all of the
!"illegal" characters (represented as int32) stored in it. The CodePoint
!algorithm is taken from Section 3.7 of http://www.unicode.org/book/ch03.pdf
!
Actually, this is UTF-16 string handling, but it will work.


!My algorithm is obviously flawed, as it's using Grapheme's rather than code
!points so things like combining characters are falling through the cracks -
!but I am at a loss as to determine any other way to do it.
!
What precisely do you believe is failing that the standard claims must work
differently?


!Please, please, suggest a viable alternative!
!
If a problem that is not solved yet is mentioned, I will be sure to try to
do that. :-)


!--
!Chris Mullins
!
!!> Hi Chris,
!>
!> Here is the response that I got from our Windows Globalization Software
!> Design Engineer.
!>
!> ----------------
!>
!> A few corrections:
!>
!> 1) Surrogate code units are illegal in UTF-32 (only full code points are
!> acceptable).
!> 2) Surrogate code units are also illegal in UTF-8 (only the 4-byte from
of
!> supplementary characters is acceptable).
!>
!> For the above, it is legal to accept them if a process wants to for
!> backcompat reasons , but it is completely illegal for a conformant
process
!> to emit them.
!>
!> Note that grapheme clusters (called Text Elements in .NET) are not always
!> representable as single UTF-32 code points (there are many composite
forms
!> that have no precomposed from in them, since the precomposed form is only
!> added to Unicode for backcompat reasons).
!>
!> So, given the above (which seems to contradict your problem descripition
!in
!> several places), what is the question, exactly?
!> ------------------
!>
!> Thanks very much.
!>
!> Best regards,
!> Yanhong Huang
!> Microsoft Community Support
!>
!> Get Secure! ¨C www.microsoft.com/security
!> This posting is provided "AS IS" with no warranties, and confers no
!rights.
!>
!
!
!
 
Chris Mullins said:
If you have the code handy that would turn a UTF-16 String into an array of
CodePoints, I would appriciate it.

I will also need something that'll take sequence of UTF-32 characters and
convert them back into a UTF-16 string.

(one of the steps in StringPrep is a "replace" step, where certain
CodePoints are replaced with other codepoints - this essentially is a really
fancy .ToLower() algorithm...)

Well, I've possibly gone a bit overboard, writing a Utf32String
class... It's available at

http://www.pobox.com/
~skeet/csharp/miscutil/src/MiscUtil/Text/Utf32String.cs

aka http://tinyurl.com/3bjlf

It's largely untested, but it's pretty simple code - please let me know
if you have any problems with it.

You'll basically be interested in the Utf32String(String) constructor
and the ToInt32Array() method.
 
Hello Chris,

Do you still have any more concerns on it? If there is any we can do for
you, please feel free to post here.

Thanks very much.

Best regards,
Yanhong Huang
Microsoft Community Support

Get Secure! ¨C www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Back
Top