Crazy with character encoding

Zhiv Kurilka · Aug 3, 2006

Hi,
I have a text file with following content:
"((^)|(.* +))§§§§§§§§"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "§" is lost

If read it with UTF7

then "+" is lost.

Please help, how can I read the file into string so that I have all
characters?

Thancs

Jon Skeet [C# MVP] · Aug 3, 2006

Zhiv Kurilka said:
I have a text file with following content:
"((^)|(.* +))§§§§§§§§"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "§" is lost

Yes, because that isn't an ASCII character.

If read it with UTF7

then "+" is lost.

Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside
mail.)

Please help, how can I read the file into string so that I have all
characters?

Well, what encoding is the file in? What created it?

Zhiv Kurilka · Aug 3, 2006

I have created it by hand, it is just a number of characters. I suppose I
nead a bytereader for that. Right?

Zhiv Kurilka said:
I have a text file with following content:
"((^)|(.* +))§§§§§§§§"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "§" is lost

Yes, because that isn't an ASCII character.

If read it with UTF7

then "+" is lost.

Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside
mail.)

Please help, how can I read the file into string so that I have all
characters?

Well, what encoding is the file in? What created it?

Morten Wennevik · Aug 3, 2006

The question is rather with what did you create it, and how did you save it.
I'm guessing it is saved with the default ansi table for your computer, in which case using Encoding.Default when reading it should give you the proper string.

Zhiv Kurilka · Aug 3, 2006

I have created it with Visual Studio text editor. It is a plain text file.
(I suppose). I opened a new text document and wrote those symbols into it.
Encoding.default gives a wrong string.
I am trying byte reader now

Morten Wennevik · Aug 3, 2006

A byte reader won't help you as it needs the same kind of encoding as the StreamReader to be able to make sense of the bytes.

My Visual Studio 2005 seems to want to save a text file as Windows-1252, so you can try using that.

StreamReader("file.txt", Encoding.GetEncoding("Windows-1252"));

I have created it with Visual Studio text editor. It is a plain text file.
(I suppose). I opened a new text document and wrote those symbols into it.
Encoding.default gives a wrong string.
I am trying byte reader now

Markus Stoeger · Aug 3, 2006

Zhiv said:
I have created it with Visual Studio text editor. It is a plain text file.
(I suppose). I opened a new text document and wrote those symbols into it.
Encoding.default gives a wrong string.
I am trying byte reader now

Have you tried to read it as UTF8? I think VS saves files in that format.

Max

Zhiv Kurilka · Aug 3, 2006

Some of the files are created using VS2003 others VS2005. I need some way to
get encoding from file automatically. Is it possible?
P.S. I have tried UTF8. For most files it fails.
I am sorry, but I still don't understand what is going on. Why VS editor
shows files properly, but I can't write them?

Morten Wennevik · Aug 3, 2006

VS Editor shows files properly because it reads them using the correct encoding.
Have you tried Windows-1252?

Jon Skeet [C# MVP] · Aug 3, 2006

Zhiv Kurilka said:
Some of the files are created using VS2003 others VS2005. I need some way to
get encoding from file automatically. Is it possible?

No. There are ways of making a reasonable guess, but it would still be
a guess.

P.S. I have tried UTF8. For most files it fails.

So it's not UTF-8 and it's not the default encoding for the system.
That's fairly odd. Perhaps you could mail me some of the files?

I am sorry, but I still don't understand what is going on. Why VS editor
shows files properly, but I can't write them?

Visual Studio presumably guesses correctly what encoding they're in.

It sounds like you're still not really sure what an encoding is though.
See if
http://www.pobox.com/~skeet/csharp/unicode.html helps.

Fabio · Aug 3, 2006

Zhiv Kurilka said:
Hi,
I have a text file with following content:
"((^)|(.* +))§§§§§§§§"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "§" is lost

If read it with UTF7

then "+" is lost.

Please help, how can I read the file into string so that I have all
characters?

I had the same problem: encoding convert the byte into char using their
rules, but I just want the char that corrispond to the byte without any
conversion.

The solution is boring but quite simple: read and write the file as a byte
array and restore it casting each byte into char and back:

public static byte[] GetBytes(string s)
{
byte[] b = new byte[s.Length];
for (int i = 0; i < b.Length; ++i)
{
b = (byte)s;
}
return b;
}

public static byte[] GetBytes(char[] c)
{
byte[] b = new byte[c.Length];
for (int i = 0; i < b.Length; ++i)
{
b = (byte)c;
}
return b;
}

public static string GetString(byte[] buffer)
{
return new string(GetChars(buffer));
}

public static char[] GetChars(byte[] b)
{
char[] c = new char[b.Length];
for (int i = 0; i < b.Length; ++i)
{
c = (char)b;
}
return c;
}

Jon Skeet [C# MVP] · Aug 3, 2006

Fabio said:
I had the same problem: encoding convert the byte into char using their
rules, but I just want the char that corrispond to the byte without any
conversion.

That's like saying you want the English that corresponds to a French
word without any translation.

The solution is boring but quite simple: read and write the file as a byte
array and restore it casting each byte into char and back:

public static byte[] GetBytes(string s)
{
byte[] b = new byte[s.Length];
for (int i = 0; i < b.Length; ++i)
{
b = (byte)s;
}
return b;
}

That's effectively using ISO-Latin-1 encoding. It's still an encoding.

MyndPhlyp · Aug 3, 2006

Zhiv Kurilka said:
Hi,
I have a text file with following content:
"((^)|(.* +))§§§§§§§§"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "§" is lost

If read it with UTF7

then "+" is lost.

Please help, how can I read the file into string so that I have all
characters?

Sounds to me like you are running into Unicode encoding - characters encoded
with both big-endian and little-endian. Try using Encoding.Unicode. See if
that helps.

Zhiv Kurilka · Aug 3, 2006

Dear Sirs,
I have uploaded the file:
http://a1234113.narod.ru/test.zip

I tried all your suggestions.
Dim _sr As New System.IO.StreamReader(_fn, System.Text.Encoding.XXXX)

m_filetext = _sr.ReadToEnd

_sr.Close()

Either + or § is missing or all is crap.

Could you give me an advice?

Thanks a lot

Jon Skeet [C# MVP] · Aug 3, 2006

Zhiv Kurilka said:
Dear Sirs,
I have uploaded the file:
http://a1234113.narod.ru/test.zip

I tried all your suggestions.
Dim _sr As New System.IO.StreamReader(_fn, System.Text.Encoding.XXXX)

m_filetext = _sr.ReadToEnd

_sr.Close()

Either + or § is missing or all is crap.

Could you give me an advice?

Encoding.Default works fine for me.

Fabio · Aug 3, 2006

Jon Skeet said:
That's like saying you want the English that corresponds to a French
word without any translation.

Mmmm... what I want is a "double way" conversion.
Each character into the ASCII table has a corrisponding byte, so I want that
a char converted to a byte can be reversed having back the original value.

All the encoders exposed by System.Text I tried do some transformation on
the value and the original information can be lost.

That's effectively using ISO-Latin-1 encoding. It's still an encoding.

Can I have this bheavior directly via some .Net encoder?

Thanks

Jon Skeet [C# MVP] · Aug 3, 2006

Fabio said:
Mmmm... what I want is a "double way" conversion.
Each character into the ASCII table has a corrisponding byte, so I want that
a char converted to a byte can be reversed having back the original value.

There's no way you can do that with a single byte, as a char is a
16-bit value and a byte is an 8-bit value.

All the encoders exposed by System.Text I tried do some transformation on
the value and the original information can be lost.

If you want to encode arbitrary binary data as text data and then
decode it, you should use Base64 - that's what it's there for. Pretty
much any other scheme is asking for trouble.

If you want to encode arbitrary Unicode text data as binary data, I'd
normally suggest using UTF-8. It's efficient for "mainly ASCII" text,
and covers the whole of Unicode.

Can I have this bheavior directly via some .Net encoder?

You can use Encoding.GetEncoding(28591) but be aware that between 128
and 139 there's a bit of a no-mans-land. There's contradictory
evidence, but some of it points to ISO-8859-1 not having any characters
defined in that range.

Herfried K. Wagner [MVP] · Aug 3, 2006

Jon Skeet said:
Encoding.Default works fine for me.

Maybe the OP's version of Windows uses a different default Windows-ANSI
codepage.

Jon Skeet [C# MVP] · Aug 3, 2006

Herfried K. Wagner said:
Maybe the OP's version of Windows uses a different default Windows-ANSI
codepage.

But in that case, I'd have expected Visual Studio to use that default
encoding too - if it works in Studio and it's CP-1252, I can't think
why Studio would choose 1252 instead of the default code page.

Branco Medeiros · Aug 4, 2006

Fabio wrote:

Mmmm... what I want is a "double way" conversion.
Each character into the ASCII table has a corrisponding byte, so I want that
a char converted to a byte can be reversed having back the original value.

All the encoders exposed by System.Text I tried do some transformation on
the value and the original information can be lost.

Can I have this bheavior directly via some .Net encoder?

The Windows ANSI encoding (engoding number 1252) usually works for me,
because (AFAIK) it doesn't apply any transformation to the individual
byte, i.e., there's a mapping from each byte value to each ANSI char,
for a total of 256 possible chars (control chars included).

Dim E As System.Text.Encoding = _
System.Text.Encoding.GetEncoding(1252)

Many encodigs use two or four bytes in the representation of a char;
others use a multibyte system where some specific byte values indicate
that the following sequence is a multibyte char.

This is not the case with the ANSI encoding. In ANSI, each byte value
matches a corresponding char. Of course, if the string you're encoding
contains chars outside the ANSI range, such chars will be
misrepresented. Also, if you read a non-ansi sequence of bytes and
convert them to a string using ANSI, you'll probably get some strange
results.

HTH.

Regards,

Branco.

Crazy with character encoding

Zhiv Kurilka

Jon Skeet [C# MVP]

Zhiv Kurilka

Morten Wennevik

Zhiv Kurilka

Morten Wennevik

Markus Stoeger

Zhiv Kurilka

Morten Wennevik

Jon Skeet [C# MVP]

Fabio

Jon Skeet [C# MVP]

MyndPhlyp

Zhiv Kurilka

Jon Skeet [C# MVP]

Fabio

Jon Skeet [C# MVP]

Herfried K. Wagner [MVP]

Jon Skeet [C# MVP]

Branco Medeiros