Crazy with character encoding

  • Thread starter Thread starter Zhiv Kurilka
  • Start date Start date
Z

Zhiv Kurilka

Hi,
I have a text file with following content:
"((^)|(.* +))§§§§§§§§"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()



Then "§" is lost

If read it with UTF7

then "+" is lost.

Please help, how can I read the file into string so that I have all
characters?

Thancs
 
Zhiv Kurilka said:
I have a text file with following content:
"((^)|(.* +))§§§§§§§§"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "§" is lost

Yes, because that isn't an ASCII character.
If read it with UTF7

then "+" is lost.

Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside
mail.)
Please help, how can I read the file into string so that I have all
characters?

Well, what encoding is the file in? What created it?
 
I have created it by hand, it is just a number of characters. I suppose I
nead a bytereader for that. Right?

Zhiv Kurilka said:
I have a text file with following content:
"((^)|(.* +))§§§§§§§§"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "§" is lost

Yes, because that isn't an ASCII character.
If read it with UTF7

then "+" is lost.

Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside
mail.)
Please help, how can I read the file into string so that I have all
characters?

Well, what encoding is the file in? What created it?
 
The question is rather with what did you create it, and how did you save it.
I'm guessing it is saved with the default ansi table for your computer, in which case using Encoding.Default when reading it should give you the proper string.
 
I have created it with Visual Studio text editor. It is a plain text file.
(I suppose). I opened a new text document and wrote those symbols into it.
Encoding.default gives a wrong string.
I am trying byte reader now
 
A byte reader won't help you as it needs the same kind of encoding as the StreamReader to be able to make sense of the bytes.

My Visual Studio 2005 seems to want to save a text file as Windows-1252, so you can try using that.

StreamReader("file.txt", Encoding.GetEncoding("Windows-1252"));



I have created it with Visual Studio text editor. It is a plain text file.
(I suppose). I opened a new text document and wrote those symbols into it.
Encoding.default gives a wrong string.
I am trying byte reader now
 
Zhiv said:
I have created it with Visual Studio text editor. It is a plain text file.
(I suppose). I opened a new text document and wrote those symbols into it.
Encoding.default gives a wrong string.
I am trying byte reader now

Have you tried to read it as UTF8? I think VS saves files in that format.

Max
 
Some of the files are created using VS2003 others VS2005. I need some way to
get encoding from file automatically. Is it possible?
P.S. I have tried UTF8. For most files it fails.
I am sorry, but I still don't understand what is going on. Why VS editor
shows files properly, but I can't write them?
 
VS Editor shows files properly because it reads them using the correct encoding.
Have you tried Windows-1252?
 
Zhiv Kurilka said:
Some of the files are created using VS2003 others VS2005. I need some way to
get encoding from file automatically. Is it possible?

No. There are ways of making a reasonable guess, but it would still be
a guess.
P.S. I have tried UTF8. For most files it fails.

So it's not UTF-8 and it's not the default encoding for the system.
That's fairly odd. Perhaps you could mail me some of the files?
I am sorry, but I still don't understand what is going on. Why VS editor
shows files properly, but I can't write them?

Visual Studio presumably guesses correctly what encoding they're in.

It sounds like you're still not really sure what an encoding is though.
See if
http://www.pobox.com/~skeet/csharp/unicode.html helps.
 
Zhiv Kurilka said:
Hi,
I have a text file with following content:
"((^)|(.* +))§§§§§§§§"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()



Then "§" is lost

If read it with UTF7

then "+" is lost.

Please help, how can I read the file into string so that I have all
characters?

I had the same problem: encoding convert the byte into char using their
rules, but I just want the char that corrispond to the byte without any
conversion.

The solution is boring but quite simple: read and write the file as a byte
array and restore it casting each byte into char and back:


public static byte[] GetBytes(string s)
{
byte[] b = new byte[s.Length];
for (int i = 0; i < b.Length; ++i)
{
b = (byte)s;
}
return b;
}

public static byte[] GetBytes(char[] c)
{
byte[] b = new byte[c.Length];
for (int i = 0; i < b.Length; ++i)
{
b = (byte)c;
}
return b;
}


public static string GetString(byte[] buffer)
{
return new string(GetChars(buffer));
}

public static char[] GetChars(byte[] b)
{
char[] c = new char[b.Length];
for (int i = 0; i < b.Length; ++i)
{
c = (char)b;
}
return c;
}
 
Fabio said:
I had the same problem: encoding convert the byte into char using their
rules, but I just want the char that corrispond to the byte without any
conversion.

That's like saying you want the English that corresponds to a French
word without any translation.
The solution is boring but quite simple: read and write the file as a byte
array and restore it casting each byte into char and back:


public static byte[] GetBytes(string s)
{
byte[] b = new byte[s.Length];
for (int i = 0; i < b.Length; ++i)
{
b = (byte)s;
}
return b;
}


That's effectively using ISO-Latin-1 encoding. It's still an encoding.
 
Zhiv Kurilka said:
Hi,
I have a text file with following content:
"((^)|(.* +))§§§§§§§§"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()



Then "§" is lost

If read it with UTF7

then "+" is lost.

Please help, how can I read the file into string so that I have all
characters?

Sounds to me like you are running into Unicode encoding - characters encoded
with both big-endian and little-endian. Try using Encoding.Unicode. See if
that helps.
 
Dear Sirs,
I have uploaded the file:
http://a1234113.narod.ru/test.zip

I tried all your suggestions.
Dim _sr As New System.IO.StreamReader(_fn, System.Text.Encoding.XXXX)

m_filetext = _sr.ReadToEnd

_sr.Close()

Either + or § is missing or all is crap.

Could you give me an advice?

Thanks a lot
 
Zhiv Kurilka said:
Dear Sirs,
I have uploaded the file:
http://a1234113.narod.ru/test.zip

I tried all your suggestions.
Dim _sr As New System.IO.StreamReader(_fn, System.Text.Encoding.XXXX)

m_filetext = _sr.ReadToEnd

_sr.Close()

Either + or § is missing or all is crap.

Could you give me an advice?

Encoding.Default works fine for me.
 
Jon Skeet said:
That's like saying you want the English that corresponds to a French
word without any translation.

Mmmm... what I want is a "double way" conversion.
Each character into the ASCII table has a corrisponding byte, so I want that
a char converted to a byte can be reversed having back the original value.

All the encoders exposed by System.Text I tried do some transformation on
the value and the original information can be lost.


That's effectively using ISO-Latin-1 encoding. It's still an encoding.

Can I have this bheavior directly via some .Net encoder?

Thanks
 
Fabio said:
Mmmm... what I want is a "double way" conversion.
Each character into the ASCII table has a corrisponding byte, so I want that
a char converted to a byte can be reversed having back the original value.

There's no way you can do that with a single byte, as a char is a
16-bit value and a byte is an 8-bit value.
All the encoders exposed by System.Text I tried do some transformation on
the value and the original information can be lost.

If you want to encode arbitrary binary data as text data and then
decode it, you should use Base64 - that's what it's there for. Pretty
much any other scheme is asking for trouble.

If you want to encode arbitrary Unicode text data as binary data, I'd
normally suggest using UTF-8. It's efficient for "mainly ASCII" text,
and covers the whole of Unicode.
Can I have this bheavior directly via some .Net encoder?

You can use Encoding.GetEncoding(28591) but be aware that between 128
and 139 there's a bit of a no-mans-land. There's contradictory
evidence, but some of it points to ISO-8859-1 not having any characters
defined in that range.
 
Herfried K. Wagner said:
Maybe the OP's version of Windows uses a different default Windows-ANSI
codepage.

But in that case, I'd have expected Visual Studio to use that default
encoding too - if it works in Studio and it's CP-1252, I can't think
why Studio would choose 1252 instead of the default code page.
 
Fabio wrote:
Mmmm... what I want is a "double way" conversion.
Each character into the ASCII table has a corrisponding byte, so I want that
a char converted to a byte can be reversed having back the original value.

All the encoders exposed by System.Text I tried do some transformation on
the value and the original information can be lost.
Can I have this bheavior directly via some .Net encoder?

The Windows ANSI encoding (engoding number 1252) usually works for me,
because (AFAIK) it doesn't apply any transformation to the individual
byte, i.e., there's a mapping from each byte value to each ANSI char,
for a total of 256 possible chars (control chars included).

Dim E As System.Text.Encoding = _
System.Text.Encoding.GetEncoding(1252)

Many encodigs use two or four bytes in the representation of a char;
others use a multibyte system where some specific byte values indicate
that the following sequence is a multibyte char.

This is not the case with the ANSI encoding. In ANSI, each byte value
matches a corresponding char. Of course, if the string you're encoding
contains chars outside the ANSI range, such chars will be
misrepresented. Also, if you read a non-ansi sequence of bytes and
convert them to a string using ANSI, you'll probably get some strange
results.

HTH.

Regards,

Branco.
 
Back
Top