Byte size of characters when encoding

  • Thread starter Thread starter Vladimir
  • Start date Start date
But that can't be represented by a single .NET character. It requires a
surrogate pair, which I'd expect to count as two characters as far as
the input of GetMaxByteCount is concerned - after all, the Encoding
will see two distinct characters making up the surrogate pair when it's
asked to encode a string or char array.
True. But what is the main reason to use GetMaxByteCount?
Well, if I have a unicode string and want to alloc a buffer for the
result of the conversion. Then the typical use is
length_of_the_string * GetMaxByteCount()
Dealing with a string means I can get surrogates.
 
Mihai N. said:
This is what I tried to combat earlier.
The Unicode range is 0-10FFFF
This means max 4 bytes. Anything above 4 is possible but incorrect, and
can be produced only by broken encoders.

This tells me it is aware of surrogates, and in fact uses utf16.
The Windows API for NT/2000/XP/2003 is UCS2, but .NET might be UTF16.

The string class itself isn't aware of surrogates, as far as I know.
The encoder needs to be aware in order to know how to encode them, but
the question is whether the count parameter should treat a surrogate
pair as two characters or one - given the rest of the API which
strongly leans towards them being two characters, that's what I think
should happen here. In either case, the documentation should be very
clear about this.
 
Mihai N. said:
True. But what is the main reason to use GetMaxByteCount?
Well, if I have a unicode string and want to alloc a buffer for the
result of the conversion. Then the typical use is
length_of_the_string * GetMaxByteCount()
Dealing with a string means I can get surrogates.

Yes - and if GetMaxByteCount assumes that the surrogates will count as
two characters, you can just use:

int maxSize = encoding.GetMaxByteCount(myString.Length);
byte[] buffer = new byte[maxSize];
....

String.Length reports surrogates as two characters. For instance:

using System;
using System.Text;

class Test
{
static void Main()
{
// Gothic letter AHSA, UTF-32 value of U+10330
string x = "\ud800\udf30";

Console.WriteLine (x.Length);
Console.WriteLine (Encoding.UTF8.GetBytes(x).Length);
}
}

Making UTF8Encoding.GetMaxByteCount(count) return count*3 will always
work with the type of code given earlier for creating a new buffer, and
will lead to less wastage than returning count*4.

It's only if String.Length counted surrogates as single characters that
you'd need to return count*4.
 
Vladimir, Unicode is NOT UTF-16. And .Net doesn't use UTF-16 internally,
it
uses UCS-2, which is different encoding.

Different in what?

UnicodeEncoding class represents a UTF-16 encoding of Unicode characters
(this is from documentation).
And it works straight forward with Char structure.
 
Vladimir said:
Different in what?

UnicodeEncoding class represents a UTF-16 encoding of Unicode characters
(this is from documentation).
And it works straight forward with Char structure.

The difference is that UCS-2 can only encode Unicode characters 0-
0xffff. UTF-16 can encode the whole of Unicode.

I'm not *entirely* clear, but I believe that the difference is fairly
minimal in .NET itself, unless you view the characters which form
surrogate pairs as invalid UCS-2 characters (pass on that one, I'm
afraid). If you had a 32-bit character data type to start with,
however, a correct UCS-2 encoding would reject characters above 0xffff,
whereas a correct UTF-16 encoding would cope.

I guess another way of looking at it (please someone, correct me if I'm
wrong!) is that although each character in .NET is only UCS-2, strings
are sometimes regarded as UTF-16. (It's the "sometimes" which is the
problem, here.)
 
Vladimir said:
By the way.
Is there a way to compress Unicode strings by SCSU in .Net?

I don't know of any way built into the framework. I'd be happy to
collaborate with someone on an open source solution, if people think it
would be useful.
 
It's only if String.Length counted surrogates as single characters that
you'd need to return count*4.
This is why is used length_of_the_string (something generic) and not
String.Length (real API).
My guess it that the .NET and the strings in .NET are not yet fully
aware of surrogates. Some parts have to be (convertor), some parts not.
String.Length returns the number of Chars, but these are .NET chars,
not Unicode chars.
At some point you may need some api to tell you the length of the string
in Unicode chars. Imagine someone typing 5 unicode characters, all of them
in the surrogate area. String.Length returns 10, the application complains
that the user name (for instance) should be max 8 characters, and the user
is puzzled, because he did type only 5.
But the IME is not there for this, and many things are not in place, yet.

We can assume this will be cleaned out at some point. All we can do is
understand the differences between Unicode (the standard) and the real
life use of Unicode (.NET, NT, XP, Unix, etc). Know what the standard
states and what the implementations do different.
 
Mihai N. said:
This is why is used length_of_the_string (something generic) and not
String.Length (real API).

What do you mean by "This is why is used"? Who are you saying is using
this code?
My guess it that the .NET and the strings in .NET are not yet fully
aware of surrogates. Some parts have to be (convertor), some parts not.
String.Length returns the number of Chars, but these are .NET chars,
not Unicode chars.
Yup.

At some point you may need some api to tell you the length of the string
in Unicode chars.

Indeed. I wrote a Utf32String class a while ago which does all this,
and can convert to and from "normal" strings.
Imagine someone typing 5 unicode characters, all of them
in the surrogate area. String.Length returns 10, the application complains
that the user name (for instance) should be max 8 characters, and the user
is puzzled, because he did type only 5.

Blech - yes, that's horrible.
But the IME is not there for this, and many things are not in place, yet.

We can assume this will be cleaned out at some point. All we can do is
understand the differences between Unicode (the standard) and the real
life use of Unicode (.NET, NT, XP, Unix, etc). Know what the standard
states and what the implementations do different.

Yup. To be honest, I can't see it being *cleanly* sorted without taking
the hit of going for full UTF-32 (or UCS-4 - I don't know if there's
any difference) characters. Doing that would be a nasty memory hit, but
it may be what's required.
 
Not true again, UTF-16 can only encode 0 - 0x10FFFF
(http://www.ietf.org/rfc/rfc2781.txt), while UCS-2 can only use characters
0 - 0xFFFF. So I was wrong, .Net uses UTF-16. UCS-2 is simply using 16-bit
Unicode characters, without surrogate pairs (those are created by UTF-16
encoding).

Jerry
 
Jerry Pisk said:
Not true again, UTF-16 can only encode 0 - 0x10FFFF
(http://www.ietf.org/rfc/rfc2781.txt), while UCS-2 can only use characters
0 - 0xFFFF. So I was wrong, .Net uses UTF-16. UCS-2 is simply using 16-bit
Unicode characters, without surrogate pairs (those are created by UTF-16
encoding).

But each character itself in .NET is only 16 bits. It's only strings
which have the concept of surrogate pairs, surely. The .NET concept of
a character is limited to UCS-2, but other things can interpret
sequences of those characters as UTF-16 sequences.

If you could state *exactly* which part of my post was "not true" it
would make it easier to either defend my position or retract it though.
 
Not true was referring to you saying that UTF-16 can encode the whole
Unicode range. It can't, the maximum value it can encode is 0x10FFFF, not
the whole Unicode range (which doesn't currently use higher values but it
also doesn't say those are invalid).

Jerry
 
Jerry Pisk said:
Not true was referring to you saying that UTF-16 can encode the whole
Unicode range. It can't, the maximum value it can encode is 0x10FFFF, not
the whole Unicode range (which doesn't currently use higher values but it
also doesn't say those are invalid).

I don't believe that's true. While ISO/IEC 10646 contains 2^31 code
positions, I believe the Unicode standard itself limits characters to
the BMP or the first supplementary 14 planes of ISO/IEC 10646. From the
Unicode standard:

<quote>
The Principles and Procedures document of JTC1/SC2/WG2 states that all
future assignments of characters to 10646 will be constrained to the
BMP or the first 14 supplementary planes. This is to ensure
interoperability between the 10646 transformation formats (see below).
It also guarantees interoperability with implementations of the Unicode
Standard, for which only code positions 0..10FFFF16 are meaningful.
</quote>

From elsewhere in the standard
(http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf):

<quote>
In the Unicode Standard, the codespace consists of the integers from 0
to 10FFFF16, comprising 1,114,112 code points available for assigning
the repertoire of abstract characters. Of course, there are constraints
on how the codespace is organized, and particular areas of the
codespace have been set aside for encoding of certain kinds of abstract
characters or for other uses in the standard. For more on the
allocation of the Unicode codespace, see Section 2.8, Unicode
Allocation.
</quote>
 
This is why is used length_of_the_string (something generic) and not
What do you mean by "This is why is used"? Who are you saying is using
this code?
My mistake. "This is why I used". Was kind of pseudo-code to avoid any
specific API.

Otherwise, we seem to agree on all :-)
 
Not true was referring to you saying that UTF-16 can encode the whole
Unicode range. It can't, the maximum value it can encode is 0x10FFFF, not
the whole Unicode range (which doesn't currently use higher values but it
also doesn't say those are invalid).
0 - 0x10FFFF IS the whole Unicode range.
 
Mihai N. said:
My mistake. "This is why I used". Was kind of pseudo-code to avoid any
specific API.

Otherwise, we seem to agree on all :-)

Yup. I've just submitted a comment to the MSDN team to make the docs
more explicit.
 
Back
Top