Crazy with character encoding

Jon Skeet [C# MVP] · Aug 4, 2006

Branco said:
The Windows ANSI encoding (engoding number 1252) usually works for me,
because (AFAIK) it doesn't apply any transformation to the individual
byte, i.e., there's a mapping from each byte value to each ANSI char,
for a total of 256 possible chars (control chars included).

How can you say there isn't any transformation, and then talk about
there being a mapping from each byte value to a character? That *is*
the transformation.

Talking about "the" Windows ANSI Encoding is like talking about "the"
extended ASCII encoding. There are lots of different encodings which
exhibit the same behaviour as 1252, i.e. they have a mapping from any
byte to one of the 256 characters they represent. Each represents a
different set of 256 characters.

This is not the case with the ANSI encoding. In ANSI, each byte value
matches a corresponding char. Of course, if the string you're encoding
contains chars outside the ANSI range, such chars will be
misrepresented. Also, if you read a non-ansi sequence of bytes and
convert them to a string using ANSI, you'll probably get some strange
results.

Exactly - so it's like any other encoding: you've got to make sure you
use the right one.

Code page 1252 has no magic powers.

Jon

Fabio Z · Aug 4, 2006

"Jon Skeet [C# MVP]" <[email protected]> ha scritto nel messaggio

There's no way you can do that with a single byte, as a char is a
16-bit value and a byte is an 8-bit value.

Wait.
Let's take for a moment VB6.
It uses Unicode strings, but Chr(200) (the "È" for example) is always
perfectly reversible into 200 using ASC("È").

This works well if I use (char)200 <--> (byte)'È'.

There is no way if I use an encoder: the char I encode is not returned
correctly decoding it, i.e. I can encode "È" into a byte value (of course
using a NON double byte encoder) and when I decode it back I could get a
"§".

This is not so good when I do comunications via socket or via RS232.

The ASCII table give a number (and only one) for each char.
Encoder/Decoder seems to assign different chars to the same number or seems
to lost informations so decoding the number I could get a char that is not
the one encoded.

Jon Skeet [C# MVP] · Aug 4, 2006

Fabio said:
Wait.
Let's take for a moment VB6.
It uses Unicode strings, but Chr(200) (the "È" for example) is always
perfectly reversible into 200 using ASC("È").

This works well if I use (char)200 <--> (byte)'È'.

Are you suggesting that VB magically manages to represent 65536
different values in a single byte? I suspect you'll find there are
plenty of Unicode characters (actually UCS-2 characters - let's not go
into full Unicode > U+FFFF for the moment) for which ASC doesn't work
on systems with a fixed single-byte default character encoding.

There is no way if I use an encoder: the char I encode is not returned
correctly decoding it, i.e. I can encode "È" into a byte value (of course
using a NON double byte encoder) and when I decode it back I could get a
"§".

If you use the same encoding for both encoding and decoding, *and* if
that encoding supports the character you wish to encode, it will always
return the correct character.

This is not so good when I do comunications via socket or via RS232.

Well, it's not so good if you don't use the same encoding on both
sides...

The ASCII table give a number (and only one) for each char.
Encoder/Decoder seems to assign different chars to the same number or seems
to lost informations so decoding the number I could get a char that is not
the one encoded.

You still seem to be confused as to the purpose of encodings. Please
read
http://www.pobox.com/~skeet/csharp/unicode.html

Jon

Fabio Z · Aug 4, 2006

Jon Skeet said:
If you use the same encoding for both encoding and decoding, *and* if
that encoding supports the character you wish to encode, it will always
return the correct character.

I could be confused about this but I'm not so stupid to use different
encoders to encode and decode.
If I get some time I'll provide an example.

Jon Skeet [C# MVP] · Aug 4, 2006

Fabio said:
I could be confused about this but I'm not so stupid to use different
encoders to encode and decode.

And similarly the designers of encodings aren't so stupid as to stop
you from encoding and then decoding to get back the original text

If I get some time I'll provide an example.

That would be good. I suspect you'll find it hard to provide one
without including characters which aren't supported by the chosen
encoding (or a code error).

Jon

Fabio Z · Aug 4, 2006

That would be good. I suspect you'll find it hard to provide one
without including characters which aren't supported by the chosen
encoding (or a code error).

Ok, waiting for it, can you give me an example that can convert a byte[]
that contains all the 0..255 byte values to a string and that convert it
back to the original byte array.

Jon Skeet [C# MVP] · Aug 4, 2006

Fabio said:
That would be good. I suspect you'll find it hard to provide one
without including characters which aren't supported by the chosen
encoding (or a code error).

Click to expand...

Ok, waiting for it, can you give me an example that can convert a byte[]
that contains all the 0..255 byte values to a string and that convert it
back to the original byte array.

Sure - although it's not a good idea (see later).

using System;
using System.Text;

class Test
{
static void Main()
{
byte[] b = new byte[256];
for (int i=0; i < 256; i++)
{
b = (byte)i;
}
Encoding enc = Encoding.GetEncoding(28591);
string x = enc.GetString(b);

byte[] o = enc.GetBytes(x);
Console.WriteLine ("Length={0}", o.Length);
for (int i=0; i < 256; i++)
{
if (o != i)
{
Console.WriteLine ("Difference at index {0}", i);
}
}
}
}

Now, that's demonstrating that it happens to work, but it's not a good
way of encoding arbitrary binary data. To do that, I'd recommend using
Base64 - Convert.ToBase64String and Convert.FromBase64String.

Encodings should be used when you *start* with text data, encode it to
binary, and then decode that binary to text data. Decoding binary data
which didn't really start off as text and then get encoded is a bad
idea.

Jon

Larry Lard · Aug 4, 2006

Fabio said:
That would be good. I suspect you'll find it hard to provide one
without including characters which aren't supported by the chosen
encoding (or a code error).

Click to expand...

Ok, waiting for it, can you give me an example that can convert a byte[]
that contains all the 0..255 byte values to a string and that convert it
back to the original byte array.

You didn't say anything about requiring the string to contain the same
number of characters as the byte[] array has members, so:

using System;

class Program
{
static void Main(string[] args)
{
byte[] data = new byte[1024];

for (int i = 0; i <= 255; i++)
data =
data[i + 256] =
data[i + 512] =
data[i + 768] = (byte)i;

// I have some byte data, but I can't print it!

string printable = Convert.ToBase64String(data);

// Now I have the data in printable form, look:

Console.WriteLine(printable);

// I should be able to get the data back, of course:

byte[] data2 = Convert.FromBase64String(printable);

// is it the same?
bool theSame = true;

if (data.Length == data2.Length)
{
for (int i = 0; i < data.Length; i++)
if (data == data2)
// carry on
;
else
{
theSame = false;
break;
}

}
else
theSame = false;

if (theSame)
Console.WriteLine("Data is the same after transformation");
else
Console.WriteLine("Data is NOT the same!!!!");

Console.ReadLine();
}
}

Branco Medeiros · Aug 4, 2006

Jon Skeet [C# MVP] wrote (inline):

How can you say there isn't any transformation, and then talk about
there being a mapping from each byte value to a character? That *is*
the transformation.

I thought it was clear that the kind of transformation I was talking
about had to do with dropping control chars or composition of chars
outside the Ansi range (codes 0 to 255). Of course, mapping a single
byte to the corresponding (Ansi) char is the actual transformation.
Thanks for point it out.

Talking about "the" Windows ANSI Encoding is like talking about "the"
extended ASCII encoding. There are lots of different encodings which
exhibit the same behaviour as 1252, i.e. they have a mapping from any
byte to one of the 256 characters they represent. Each represents a
different set of 256 characters.

I guess you're right when you say that there are other encondings that
act like the Ansi encoding, i.e., provide a one to one mapping from
byte to char. It would be nice if someone (yourself, perhaps) took the
time to identify them. People having to deal with legacy encodings
would certainly appreciate that.

On the other hand, I assume that there is *the* Ansi encoding,
comprising the 256 chars chosen by Microsoft to represent the Western
European latin char set, loosely based on a ANSI draft of the time
(thus the characterization as Windows-Ansi), which is code page 1252.
Of course, I may be wrong.

Code page 1252 has no magic powers.

) No, it certainly hasn't.

Best regards,

Branco.

Jon Skeet [C# MVP] · Aug 4, 2006

Branco Medeiros said:
I thought it was clear that the kind of transformation I was talking
about had to do with dropping control chars or composition of chars
outside the Ansi range (codes 0 to 255).

No - although *something* has to happen to characters outside the range
of the character set. (Note that Windows-1252 is definitely *not*
Unicode 0-255. They differ in the range 128 to 159 inclusive.)

Of course, mapping a single
byte to the corresponding (Ansi) char is the actual transformation.
Thanks for point it out.

And that's the same kind of thing that other encodings do, except they
may not be single byte to single char.

I guess you're right when you say that there are other encondings that
act like the Ansi encoding, i.e., provide a one to one mapping from
byte to char. It would be nice if someone (yourself, perhaps) took the
time to identify them. People having to deal with legacy encodings
would certainly appreciate that.

On the other hand, I assume that there is *the* Ansi encoding,
comprising the 256 chars chosen by Microsoft to represent the Western
European latin char set, loosely based on a ANSI draft of the time
(thus the characterization as Windows-Ansi), which is code page 1252.
Of course, I may be wrong.

I *think* you are, I'm afraid.

http://www.stylusstudio.com/xsllist/200205/post01200.html
and
http://www.stylusstudio.com/xsllist/200205/post61190.html
have a bit more information.

For another example of a character encoding which could be regarded as
an "ANSI" encoding, consider ASCII. This is also known as
ANSI_X3.4-1968 (according to
http://www.iana.org/assignments/character-sets)

I *believe* people often talk about whatever their default
256-character encoding is as an "ANSI encoding" - and that's not always
Windows-1252.

For more evidence of this, see
http://en.wikipedia.org/wiki/Code_page#Windows_.28ANSI.29_code_pages

In particular:
<quote>
Microsoft defined a number of code pages known as the ANSI code pages
(as the first one, 1252 was based on an ansi draft of what became ISO
8859-1).
</quote>

Mihai N. · Aug 5, 2006

Ok, waiting for it, can you give me an example that can convert a byte[]

that contains all the 0..255 byte values to a string and that convert it
back to the original byte array.

Wrong.
Most encodings have undefined areas and do not cover the complete range from
0 to 255. So some values will not be converted to Unicode (because they are
not allocated in the original encoding, to begin with).

If 0..255 is what you need, then is no text data, is binary data,
and you should use some other ways to convert to text for transfer
(MIME, BinHex, etc.).

Mihai N. · Aug 5, 2006

On the other hand, I assume that there is *the* Ansi encoding,

comprising the 256 chars chosen by Microsoft to represent the Western
European latin char set, loosely based on a ANSI draft of the time
(thus the characterization as Windows-Ansi), which is code page 1252.
Of course, I may be wrong.

What MS documentation means when it says ANSI code page is not 1252.
It is the "default system code page" and depends on the system locale.
It is 932 on Japanese sytems, 1250 on Russian, and so on
(you can get the ANSI CP for a locale by using
GetLocaleInfo with LOCALE_IDEFAULTANSICODEPAGE )

Fabio · Aug 5, 2006

If 0..255 is what you need, then is no text data, is binary data,
and you should use some other ways to convert to text for transfer
(MIME, BinHex, etc.).

A string is not just "text".
Is a sequence of chars (that in memory are bytes).

So I think you all are definetively talking in a different language than me
about this issue.

My initial code works well and cannot be replaced by some trick such as Mime
or Base64 encoding, that transforms the original value.

The old CopyMemory() did the work as I want to, because it does not say
itself "oh! this is not text! I refuse to convert it to bytes".
It treats strings for what they are: a sequence of byte, nothing more,
nothing less.

Jon Skeet [C# MVP] · Aug 5, 2006

Fabio said:
A string is not just "text".
Is a sequence of chars (that in memory are bytes).

The in-memory encoding happens to be UTF-16. It's almost irrelevant
though.

So I think you all are definetively talking in a different language than me
about this issue.

My initial code works well and cannot be replaced by some trick such as Mime
or Base64 encoding, that transforms the original value.

When you're passing binary data around as text, you really want to make
sure it doesn't get screwed up by systems which assume null-terminated
strings etc. Base64 copes with this. Your code doesn't.

The old CopyMemory() did the work as I want to, because it does not say
itself "oh! this is not text! I refuse to convert it to bytes".
It treats strings for what they are: a sequence of byte, nothing more,
nothing less.

You're doomed to run into encoding issues with that mentality, I'm
afraid. Treat binary data as binary data, text as text, and encode
between the two in rigidly defined ways. Anything else leads to
problesm.

Fabio · Aug 6, 2006

Jon Skeet said:
You're doomed to run into encoding issues with that mentality, I'm
afraid. Treat binary data as binary data, text as text, and encode
between the two in rigidly defined ways. Anything else leads to
problesm.

Ok

With my mentality I'm doomed to make serial port and sockets comunications
works [efficiently]

With "bug free text encoding mentality" them don't.

I'll accept my doom on this argument

I'll leave to Base64 and Mime encoding their role: sending and receiving
e-mails.

Jon Skeet [C# MVP] · Aug 6, 2006

Fabio said:
You're doomed to run into encoding issues with that mentality, I'm
afraid. Treat binary data as binary data, text as text, and encode
between the two in rigidly defined ways. Anything else leads to
problesm.

Click to expand...

Ok
With my mentality I'm doomed to make serial port and sockets comunications
works [efficiently]

Serial ports and sockets deal with binary data. If you've got binary
data you want to send across serial ports and sockets, you shouldn't be
converting it to or from a string to start with.

With "bug free text encoding mentality" them don't.

I'll accept my doom on this argument

I'll leave to Base64 and Mime encoding their role: sending and receiving
e-mails.

I don't remember anyone other than yourself bringing up mime encoding
(although I could be wrong). Base64 has plenty of uses outside email.

Mihai N. · Aug 6, 2006

A string is not just "text".

Is a sequence of chars (that in memory are bytes).

I am not sure what you mean. Text is also "a sequence of chars"
What is the differenct between text and string?

The main difference between text/string and "just bytes" is that not any
sequence of bytes constitute valid text.

So I think you all are definetively talking in a different language than me
about this issue. Probably.

My initial code works well and cannot be replaced by some trick such as
Mime or Base64 encoding, that transforms the original value. Then

The old CopyMemory() did the work as I want to, because it does not say
itself "oh! this is not text! I refuse to convert it to bytes".

The only code I have seen from you is this:
public static byte[] GetBytes(string s)
{
byte[] b = new byte[s.Length];
for (int i = 0; i < b.Length; ++i)
{
b = (byte)s;
}
return b;
}
which casts from a character (16 bits) to a byte (8 bits).
So it is 100% sure to loose information.

It treats strings for what they are: a sequence of byte, nothing more,
nothing less.

Click to expand...

Nope. Strings are "a certain type of sequence of bytes"
Any string is a sequence of bytes, but not any sequence of bytes is a string.

Crazy with character encoding

Jon Skeet [C# MVP]

Fabio Z

Jon Skeet [C# MVP]

Fabio Z

Jon Skeet [C# MVP]

Fabio Z

Jon Skeet [C# MVP]

Larry Lard

Branco Medeiros

Jon Skeet [C# MVP]

Mihai N.

Mihai N.

Fabio

Jon Skeet [C# MVP]

Fabio

Jon Skeet [C# MVP]

Mihai N.