C
Chris Mullins
I've spent a bit of time over the last year trying to implement RFC 3454
(Preparation of Internationalized Strings, aka 'StringPrep').
This RFC is also a dependency for RFC 3491 (Internationalized Domain Names /
IDNA) which is something that I also need to support.
The problem that I've been struggling with in .NET is that of Unicode Code
Points > 0xFFFF. These points are encoded into UTF8 using the Surrogate Pair
encoding scheme that the Unicode Spec defined in section 3.7 of the Unicode
Spec (http://www.unicode.org/book/ch03.pdf).
Related to Surrogate Pairs, are the whole set of Unicode Combining
characters.
The problem, then, is this:
When I iterate over a string using the .NET StringInfo class I get a set of
graphemes. These graphemes correctly handle the combining characters and
surrogate pairs, and end up giving me a single UTF-32 Code Point for each
grapheme.
BUT, let's say the original string had U:0x10FF1 encoded as a UTF8 surrogate
pair. This character is illegal in a particular stringprep profile.
The original string also had a combining character sequence U:301 + U:302
(for example) and the grapheme that the StringInfo class reports for this is
also U:0x10FF1.
The problem is that each of the combining characters IS legal in the
stringprep profile, but I have no way of telling if the original data was
the (illegal) UTF-32 code point, or the (legal) combining characters.
Has anyone implemented any of this stuff in .NET ?
(Preparation of Internationalized Strings, aka 'StringPrep').
This RFC is also a dependency for RFC 3491 (Internationalized Domain Names /
IDNA) which is something that I also need to support.
The problem that I've been struggling with in .NET is that of Unicode Code
Points > 0xFFFF. These points are encoded into UTF8 using the Surrogate Pair
encoding scheme that the Unicode Spec defined in section 3.7 of the Unicode
Spec (http://www.unicode.org/book/ch03.pdf).
Related to Surrogate Pairs, are the whole set of Unicode Combining
characters.
The problem, then, is this:
When I iterate over a string using the .NET StringInfo class I get a set of
graphemes. These graphemes correctly handle the combining characters and
surrogate pairs, and end up giving me a single UTF-32 Code Point for each
grapheme.
BUT, let's say the original string had U:0x10FF1 encoded as a UTF8 surrogate
pair. This character is illegal in a particular stringprep profile.
The original string also had a combining character sequence U:301 + U:302
(for example) and the grapheme that the StringInfo class reports for this is
also U:0x10FF1.
The problem is that each of the combining characters IS legal in the
stringprep profile, but I have no way of telling if the original data was
the (illegal) UTF-32 code point, or the (legal) combining characters.
Has anyone implemented any of this stuff in .NET ?