Creating a Unicode Surrogate Pair

Chris Mullins · Sep 21, 2003

I've got a big unicode character, and i'm trying to build it into a string.

The unicode character is in the range "0x10400", so it's going to require a
surrogate pair.

I've been through all the logic to iterate over strings that already have
these pairs in them, but how do I encode this Unicode Character INTO the
string? The string is UTF-8 encoded, but none of the things I've trided
using the encoders seems to work right...

Breaking it up into words (0x0001 and x0400) is obviously incorrect, as this
violates the valid ranges for a high-surrogate .

I'm at a loss as to how to deal with it from here...

Jon Skeet · Sep 21, 2003

Chris Mullins said:
I've got a big unicode character, and i'm trying to build it into a string.

The unicode character is in the range "0x10400", so it's going to require a
surrogate pair.

I've been through all the logic to iterate over strings that already have
these pairs in them, but how do I encode this Unicode Character INTO the
string? The string is UTF-8 encoded, but none of the things I've trided
using the encoders seems to work right...

Breaking it up into words (0x0001 and x0400) is obviously incorrect, as this
violates the valid ranges for a high-surrogate .

I'm at a loss as to how to deal with it from here...

Try breaking the pair into 0xd801 and 0xdc00 - I believe the algorithm
is basically:

o Subtract 0x10000
o High surrogate is 0xd800+(result/0x400)
o Low surrogate is 0xdc00+(result%0x400)

Of course, whatever's reading the string will need to know what to do
with the surrogate. I've managed to avoid using them so far,
fortunately...

Chris Mullins · Sep 21, 2003

Jon Skeet said:
Try breaking the pair into 0xd801 and 0xdc00 - I believe the algorithm
is basically:

o Subtract 0x10000
o High surrogate is 0xd800+(result/0x400)
o Low surrogate is 0xdc00+(result%0x400)

After reading more, it looks like your suggestion is the best option for
..NET. I'm going to take any code point > 0xFFFF and break it down into a
surrogate pair, according to the algorithm found at:
http://www.unicode.org/book/ch03.pdf

This says the encoding chars will be encoded as:
H = (S-0x10000) / 0x400 + 0xD800
L = (S-0x10000) % 0x400 + 0xDC00

Now, I need to figure out what to do next...

Do I just append the High/Low surrogate pairs into my .NET string, or do I
have to pass this character array through the approate UTF8/16 encoder to
turn it into an encoded byte stream, and then somehow massage that into my
string?

Jon Skeet · Sep 22, 2003

Chris Mullins said:
After reading more, it looks like your suggestion is the best option for
.NET. I'm going to take any code point > 0xFFFF and break it down into a
surrogate pair, according to the algorithm found at:
http://www.unicode.org/book/ch03.pdf

This says the encoding chars will be encoded as:
H = (S-0x10000) / 0x400 + 0xD800
L = (S-0x10000) % 0x400 + 0xDC00

Now, I need to figure out what to do next...

Do I just append the High/Low surrogate pairs into my .NET string, or do I
have to pass this character array through the approate UTF8/16 encoder to
turn it into an encoded byte stream, and then somehow massage that into my
string?

You just append them to your string - anything which can cope with
surrogates should then recognise them appropriately.

HTMLEncode: low surrogate char Error	1	Jul 27, 2007
UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET	11	Apr 21, 2004
HTMLEncode: low surrogate char Error?	3	Mar 17, 2004
Using XmlTextReader to read unicode characters	1	Nov 9, 2005
surrogate characters and chars	6	Dec 20, 2005
UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationalization of Strings (Framework oversite?)	12	Sep 21, 2003
Displaying a unicode character	2	Sep 28, 2005
Unicode character in non-unicode text file	6	Jul 7, 2005

Creating a Unicode Surrogate Pair

Chris Mullins

Jon Skeet

Chris Mullins

Jon Skeet

Ask a Question

Similar Threads