Byte size of characters when encoding

  • Thread starter Thread starter Vladimir
  • Start date Start date
V

Vladimir

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Look:

/*
Each Unicode character in a string is defined by a Unicode scalar value,
also called ...

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.
*/

With UTF-8 encoding one instance of struct Char can only occupy 1/2, 1, 1
1/2, 2 bytes?
Isn't it?
Therefore UTF8Encoding.GetMaxByteCount(charCount) must returns charCount *
2.
Because charCount means count of instance of struct Char.
Or not? May be it means count of Unicode characters?
If not, then UnicodeEncoding.GetMaxByteCount(charCount) must returns
charCount * 4.

This methods does not fit each other.
 
Vladimir said:
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.
 
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

Do you want to say that two instances of struct Char in UTF-8 can occupy 8
bytes?
 
Vladimir said:
Do you want to say that two instances of struct Char in UTF-8 can occupy 8
bytes?

It turns out that while a UTF8 character can take up to 4 bytes to be
encoded, for the Framework, a struct Char can always be encoded in at
most 3 bytes. That's because the struct char holds a 16-bit Unicode
value, and that can always be encoded in 3 or fewer bytes.

A 4-byte UTF8 encoding is only needed for Unicode code points that
require 'surrogates' - or a pair of 16-bit values to represent the
character. Surrogates cannot be represented in a single struct Char -
but I believe they are supported in strings.

Anyway, here's what can happen using struct Char:

char c1 = '\uFFFF';
char c2 = '\u1000';

byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });

If you dump the byte array, you'll see that each Char was encoded into 3
UTF8 bytes.

Jon Skeet has written an excellent article on this type of issue:

http://www.yoda.arachsys.com/csharp/unicode.html
 
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount *
2.
Do you want to say that two instances of struct Char in UTF-8 can occupy 8
bytes?

It turns out that while a UTF8 character can take up to 4 bytes to be
encoded, for the Framework, a struct Char can always be encoded in at
most 3 bytes. That's because the struct char holds a 16-bit Unicode
value, and that can always be encoded in 3 or fewer bytes.

A 4-byte UTF8 encoding is only needed for Unicode code points that
require 'surrogates' - or a pair of 16-bit values to represent the
character. Surrogates cannot be represented in a single struct Char -
but I believe they are supported in strings.

Anyway, here's what can happen using struct Char:

char c1 = '\uFFFF';
char c2 = '\u1000';

byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });

If you dump the byte array, you'll see that each Char was encoded into 3
UTF8 bytes.

It's makes me crazy.
I don't understand.

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

If charCount means unicode 32 bit character:
UnicodeEncoding.GetMaxByteCount(charCount) must returns charCount * 4.
UTF8Encoding.GetMaxByteCount(charCount) must returns charCount * 4.

If charCount means unicode 16 bit character (Char structure):
UnicodeEncoding.GetMaxByteCount(charCount) must returns charCount * 2.
UTF8Encoding.GetMaxByteCount(charCount) must returns charCount * 3.

Suppose we have a string with length 5 (length in string menas count of
instances of stuct Char).
UTF8Encoding.GetMaxByteCount(stringInstance.Length) returns 15.
But it's not true.

And.
May be in string each surrogate pair (by 16 bit characters) in UTF-8 occupy
only 4 bytes?
Yes or not?

Look:

/*
UTF?16 encodes each 16?bit character as 2 bytes. It doesn't affect the
characters at all,
and no compression occurs-its performance is excellent. UTF?16 encoding is
also referred
to as Unicode encoding.

UTF?8 encodes some characters as 1 byte, some characters as 2 bytes, some
characters
as 3 bytes, and some characters as 4 bytes. Characters with a value below
0x0080 are
compressed to 1 byte, which works very well for characters used in the
United States.
Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works
well for
European and Middle Eastern languages. Characters of 0x0800 and above are
converted to
3 bytes, which works well for East Asian languages. Finally, surrogate
character pairs are
written out as 4 bytes. UTF?8 is an extremely popular encoding, but it's
less useful than
UTF?16 if you encode many characters with values of 0x0800 or above.
*/

Does it mean that each pair of characters in UTF-16 can't be occupy more
than 4 bytes in UTF-8?

Wait a minute.
It seams that I undestend something.

Characters in UTF-16 below 0x0800 in UTF-8 can occupy less or equal to
2 bytes (in UTF-16 its occupy always 2 bytes).
Characters in UTF-16 above 0x0800 in UTF-8 will occupy 3 bytes
(in UTF-16 its occupy always 2 bytes).
Surrogate charactes pair UTF-16 in UTF-8 will occupy 4 bytes (in UTF-16 its
occupy always 4 bytes).

Right?

But then I think UTF8Encoding.GetMaxByteCount(charCount) must
returns charCount * 3.
 
Vladimir said:
It's makes me crazy.
I don't understand.

I think it's just a bug. UnicodeEncoding is doing the right thing, but
UTF8Encoding should return charCount*3, not charCount*4.
 
I think it's just a bug. UnicodeEncoding is doing the right thing, but
UTF8Encoding should return charCount*3, not charCount*4.

How can we send the bug repprot?
And...

I've found that new BitArray(int.length) causes overflow exception when
length in range from int.MaxValue - 30 to int.MaxValue.
 
You're right, it is a bug, but the correct answer is not what you think it
is. In UTF-8 a character can be up to 6 bytes, see
http://www.ietf.org/rfc/rfc2279.txt, chapter 2. As for the frameworks
internal representation - it uses UCS-2, where each character is expressed
as 2 bytes with the exception of characters larger than 0xFFFF which are
expressed as a sequence of two characters, called surrogate pair. So each
character in UCS-2 takes up two bytes but some Unicode characters have to be
expressed in pairs.

Jerry
 
Jerry Pisk said:
You're right, it is a bug, but the correct answer is not what you think it
is.

I think that depends on how you read the documentation.
In UTF-8 a character can be up to 6 bytes, see
http://www.ietf.org/rfc/rfc2279.txt, chapter 2. As for the frameworks
internal representation - it uses UCS-2, where each character is expressed
as 2 bytes with the exception of characters larger than 0xFFFF which are
expressed as a sequence of two characters, called surrogate pair. So each
character in UCS-2 takes up two bytes but some Unicode characters have to be
expressed in pairs.

That's exactly what I thought. I believe GetMaxByteCount is meant to
return the maximum number of bytes for a sequence of 16-bit characters
though, where 2 characters forming a surrogate pair counts as 2
characters in the input. That way the maximum number of bytes required
to encode a string, for instance, is GetMaxByteCount(theString.Length).
Given that pretty much the whole of the framework works on the
assumption that a character is 16 bits and that surrogate pairs *are*
two characters, this seems more useful. It would be better if it were
more explicitly documented either way, however.
 
Vladimir said:
How can we send the bug repprot?

I don't know the best way of submitting bugs for 1.1. I'll try to
remember to submit it as a Whidbey bug if I get the time to test it.
(Unfortunately time is something I'm short of at the moment.)
And...

I've found that new BitArray(int.length) causes overflow exception when
length in range from int.MaxValue - 30 to int.MaxValue.

I'm not entirely surprised, but it should at least be documented I
guess.
 
I've found that new BitArray(int.length) causes overflow exception when
I'm not entirely surprised, but it should at least be documented I
guess.

I think it should throw ArgumentOutOfRangeExcpetion, or (the best) handle
all rage from 0 to int.Max. It can be do easly.

Just replace (length + 31) / 32 to ((length % 32 == 0) ? (length / 32) :
(length / 32 + 1)).

And seems there is a problems in another constructors of BitArray.
Everywhere where used (length + 31) / 32, and lenth * 8.

For example BitArray (int[]).
It's obviously that it can handle array with length up to 67 108 864 only.
Therefore it should throw ArgumentOutOfRangeException.
But it does'nt at all.
Not sure, but I think it will throw overflow exception.
 
You're right, it is a bug, but the correct answer is not what you think it
is. In UTF-8 a character can be up to 6 bytes, see
http://www.ietf.org/rfc/rfc2279.txt, chapter 2.

I think 4 IS the right answer.
Reading the RFC tells you that up to 4 bytes are used to represent the
range between 00010000-001FFFFF.
Well, Unicode stops at 10FFFF.
Anything longer than 4 bytes is incorrect unicode.
And utf8 encoders/decoders should be aware about this, otherwise
this can even lead to security vulnerabilities (like buffer overruns).
 
Mihai N. said:
I think 4 IS the right answer.
Reading the RFC tells you that up to 4 bytes are used to represent the
range between 00010000-001FFFFF.

But that can't be represented by a single .NET character. It requires a
surrogate pair, which I'd expect to count as two characters as far as
the input of GetMaxByteCount is concerned - after all, the Encoding
will see two distinct characters making up the surrogate pair when it's
asked to encode a string or char array.
 
I don't think so, just because .Net internally uses UCS-2 doesn't mean two
surrogate characters are two characters. They're a single character as far
as Unicode is concerned.

The whole issue comes down to the documentation not being very clear. It
says GetMaxByteCount takes the number of characters to encode but it doesn't
say in what encoding. If it's number of characters in UCS-2 then you're
right, 4 is the worst case, if it's Unicode characters then 6 is the correct
value. I'm not really sure what CLR says, if it treats character data as
Unicode or as UCS-2 encoded Unicode (and I'm not talking about the internal
representation here, I'm talking about what character data type actually
stands for).

Jerry
 
The whole issue comes down to the documentation not being very clear. It
says GetMaxByteCount takes the number of characters to encode but it doesn't
say in what encoding. If it's number of characters in UCS-2 then you're
right, 4 is the worst case, if it's Unicode characters then 6 is the correct
value. I'm not really sure what CLR says, if it treats character data as
Unicode or as UCS-2 encoded Unicode (and I'm not talking about the internal
representation here, I'm talking about what character data type actually
stands for).

/*
Encoding Class

Remarks
Methods are provided to convert arrays and strings !of Unicode characters!
to
and from arrays of bytes encoded for a target code page.
*/

Therefore maximal characters count means Unicode (Utf-16) characters.

And seems implementation of ASCIIEncoding.GetBytes() does'nt know
anything about surrogat pairs. And for surrogate pair it returns two bytes.
Therefore maximal characters count does'nt mean ... you know.
 
Vladimir, Unicode is NOT UTF-16. And .Net doesn't use UTF-16 internally, it
uses UCS-2, which is different encoding.

Jerry
 
Jerry Pisk said:
I don't think so, just because .Net internally uses UCS-2 doesn't mean two
surrogate characters are two characters. They're a single character as far
as Unicode is concerned.

But they're two characters as far as almost the whole of the rest of
the .NET API is concerned. String.Length will give you two characters,
and obviously if you've got a char array the surrogate will take up two
positions.
The whole issue comes down to the documentation not being very clear.
Agreed.

It says GetMaxByteCount takes the number of characters to encode but it doesn't
say in what encoding. If it's number of characters in UCS-2 then you're
right, 4 is the worst case

No, 3 is the worst case, isn't it?
if it's Unicode characters then 6 is the correct value.
Yes.

I'm not really sure what CLR says, if it treats character data as
Unicode or as UCS-2 encoded Unicode (and I'm not talking about the internal
representation here, I'm talking about what character data type actually
stands for).

Well, the System.Char data type is for a "Unicode 16-bit char" which
isn't terribly helpful, unfortunately. From the MSDN docs for
System.Char:

<quote>
The Char value type represents a Unicode character, also called a
Unicode code point, and is implemented as a 16-bit number ranging in
value from hexadecimal 0x0000 to 0xFFFF. A single Char cannot represent
a Unicode character that is encoded as a surrogate pair. However, a
String, which is a collection of Char objects, can represent a Unicode
character encoded as a surrogate pair.
</quote>

So the docs for GetMaxByteCount ought to be clear as to whether it's a
count of System.Chars or a count of full Unicode characters. I suspect
it's *meant* to be the former, but it should definitely be clearer.
 
Vladimir said:
/*
Encoding Class

Remarks
Methods are provided to convert arrays and strings !of Unicode characters!
to
and from arrays of bytes encoded for a target code page.
*/

Therefore maximal characters count means Unicode (Utf-16) characters.

I don't think that's clear at all.
And seems implementation of ASCIIEncoding.GetBytes() does'nt know
anything about surrogat pairs.

I think in general the Encoding implementations don't guarantee to give
good results when they're passed characters which aren't in their
character set. Certainly ASCIIEncoding doesn't perform optimally in
such a situation.
 
if it's Unicode characters then 6 is the correct value.
This is what I tried to combat earlier.
The Unicode range is 0-10FFFF
This means max 4 bytes. Anything above 4 is possible but incorrect.
However, a
String, which is a collection of Char objects, can represent a Unicode
character encoded as a surrogate pair.
This tells me it is aware of surrogates, and in fact uses utf16.
The Windows API for NT/2000/XP/2003 is UCS2, but .NET might be UTF16.
 
if it's Unicode characters then 6 is the correct value.
This is what I tried to combat earlier.
The Unicode range is 0-10FFFF
This means max 4 bytes. Anything above 4 is possible but incorrect, and
can be produced only by broken encoders.
However, a
String, which is a collection of Char objects, can represent a Unicode
character encoded as a surrogate pair.
This tells me it is aware of surrogates, and in fact uses utf16.
The Windows API for NT/2000/XP/2003 is UCS2, but .NET might be UTF16.
 
Back
Top