System.Text.Encoding oddities

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Sorry about the last... Anyway, here's the question:

I've been working on some C# routines to process strings in and out of various encodings. The hope is that I can just let the user type in the encoding they want and I'll do a pretty good job of converting. Basically, I take a string as input, write it to a byte array MemoryStream and then get the bytes of the conversion out.

The oddity in my question is that when I use System.Text.UTF8Encoding as an argument to my StreamWriter, I don't get the byteorder mark in the output, but when I use System.Text.Encoding.GetEncoding ("utf-8"); I do. Shouldn't these be the same, or am I missing something basic? Seems odd. Can someone explain why?

Thanks
-mark


Example code:
public byte [] Convert (string in, Encoding enc, out length)
{
MemoryStream out_stream = new MemoryStream(in.Length*3); // allow for encoding switch expansion
System.IO.StreamWriter writer = new System.IO.StreamWriter (out_stream, enc);
writer.Write (input);
writer.Flush(); // Flush but don't close, so we can get the MemoryStream used count

byte [] output = out_stream.GetBuffer();
length = out_stream.Length;
return output;
}

byte [] test = Convert ("test", System.Text.UTF8Encoding); // no bytemark
test = Convert ("test", System.Text.GetEncoding ("utf-8")); // bytemark
 
Mark said:
Sorry about the last... Anyway, here's the question:

I've been working on some C# routines to process strings in and out
of various encodings. The hope is that I can just let the user type
in the encoding they want and I'll do a pretty good job of
converting. Basically, I take a string as input, write it to a byte
array MemoryStream and then get the bytes of the conversion out.

The oddity in my question is that when I use System.Text.UTF8Encoding
as an argument to my StreamWriter, I don't get the byteorder mark in
the output, but when I use System.Text.Encoding.GetEncoding
("utf-8"); I do. Shouldn't these be the same, or am I missing
something basic? Seems odd. Can someone explain why?

Well, it's basically not specified whether Encoding.UTF8 gives an
encoding with a byte order mark or not, or whether Encoding.GetEncoding
gives one with a BOM or not either.

If you want to make absolutely sure, you need to construct the
UTF8Encoding yourself, specifying whether or not you want a BOM as a
parameter.

However, there's a much easier way of doing conversion than creating a
StreamWriter - just call Encoding.GetBytes(string). That will never
contain a BOM.
 
Thanks..

Encoding.GetBytes() and Encoding.GetString () worked much better than my clunky approach - and the BOM (or lack thereof) is consistent...

That's a great help

-mar
 
Back
Top