Creating a Unicode Surrogate Pair

  • Thread starter Thread starter Chris Mullins
  • Start date Start date
C

Chris Mullins

I've got a big unicode character, and i'm trying to build it into a string.

The unicode character is in the range "0x10400", so it's going to require a
surrogate pair.

I've been through all the logic to iterate over strings that already have
these pairs in them, but how do I encode this Unicode Character INTO the
string? The string is UTF-8 encoded, but none of the things I've trided
using the encoders seems to work right...

Breaking it up into words (0x0001 and x0400) is obviously incorrect, as this
violates the valid ranges for a high-surrogate .

I'm at a loss as to how to deal with it from here...
 
Chris Mullins said:
I've got a big unicode character, and i'm trying to build it into a string.

The unicode character is in the range "0x10400", so it's going to require a
surrogate pair.

I've been through all the logic to iterate over strings that already have
these pairs in them, but how do I encode this Unicode Character INTO the
string? The string is UTF-8 encoded, but none of the things I've trided
using the encoders seems to work right...

Breaking it up into words (0x0001 and x0400) is obviously incorrect, as this
violates the valid ranges for a high-surrogate .

I'm at a loss as to how to deal with it from here...

Try breaking the pair into 0xd801 and 0xdc00 - I believe the algorithm
is basically:

o Subtract 0x10000
o High surrogate is 0xd800+(result/0x400)
o Low surrogate is 0xdc00+(result%0x400)

Of course, whatever's reading the string will need to know what to do
with the surrogate. I've managed to avoid using them so far,
fortunately...
 
Jon Skeet said:
Try breaking the pair into 0xd801 and 0xdc00 - I believe the algorithm
is basically:

o Subtract 0x10000
o High surrogate is 0xd800+(result/0x400)
o Low surrogate is 0xdc00+(result%0x400)

After reading more, it looks like your suggestion is the best option for
..NET. I'm going to take any code point > 0xFFFF and break it down into a
surrogate pair, according to the algorithm found at:
http://www.unicode.org/book/ch03.pdf

This says the encoding chars will be encoded as:
H = (S-0x10000) / 0x400 + 0xD800
L = (S-0x10000) % 0x400 + 0xDC00

Now, I need to figure out what to do next...

Do I just append the High/Low surrogate pairs into my .NET string, or do I
have to pass this character array through the approate UTF8/16 encoder to
turn it into an encoded byte stream, and then somehow massage that into my
string?
 
Chris Mullins said:
After reading more, it looks like your suggestion is the best option for
.NET. I'm going to take any code point > 0xFFFF and break it down into a
surrogate pair, according to the algorithm found at:
http://www.unicode.org/book/ch03.pdf

This says the encoding chars will be encoded as:
H = (S-0x10000) / 0x400 + 0xD800
L = (S-0x10000) % 0x400 + 0xDC00

Now, I need to figure out what to do next...

Do I just append the High/Low surrogate pairs into my .NET string, or do I
have to pass this character array through the approate UTF8/16 encoder to
turn it into an encoded byte stream, and then somehow massage that into my
string?

You just append them to your string - anything which can cope with
surrogates should then recognise them appropriately.
 
Back
Top