System.Text.Encoding oddities

Guest · Apr 6, 2004

Sorry about the last... Anyway, here's the question:

I've been working on some C# routines to process strings in and out of various encodings. The hope is that I can just let the user type in the encoding they want and I'll do a pretty good job of converting. Basically, I take a string as input, write it to a byte array MemoryStream and then get the bytes of the conversion out.

The oddity in my question is that when I use System.Text.UTF8Encoding as an argument to my StreamWriter, I don't get the byteorder mark in the output, but when I use System.Text.Encoding.GetEncoding ("utf-8"); I do. Shouldn't these be the same, or am I missing something basic? Seems odd. Can someone explain why?

Thanks
-mark

Example code:
public byte [] Convert (string in, Encoding enc, out length)
{
MemoryStream out_stream = new MemoryStream(in.Length*3); // allow for encoding switch expansion
System.IO.StreamWriter writer = new System.IO.StreamWriter (out_stream, enc);
writer.Write (input);
writer.Flush(); // Flush but don't close, so we can get the MemoryStream used count

byte [] output = out_stream.GetBuffer();
length = out_stream.Length;
return output;
}

byte [] test = Convert ("test", System.Text.UTF8Encoding); // no bytemark
test = Convert ("test", System.Text.GetEncoding ("utf-8")); // bytemark

Jon Skeet [C# MVP] · Apr 6, 2004

Mark said:
Sorry about the last... Anyway, here's the question:

I've been working on some C# routines to process strings in and out
of various encodings. The hope is that I can just let the user type
in the encoding they want and I'll do a pretty good job of
converting. Basically, I take a string as input, write it to a byte
array MemoryStream and then get the bytes of the conversion out.

The oddity in my question is that when I use System.Text.UTF8Encoding
as an argument to my StreamWriter, I don't get the byteorder mark in
the output, but when I use System.Text.Encoding.GetEncoding
("utf-8"); I do. Shouldn't these be the same, or am I missing
something basic? Seems odd. Can someone explain why?

Well, it's basically not specified whether Encoding.UTF8 gives an
encoding with a byte order mark or not, or whether Encoding.GetEncoding
gives one with a BOM or not either.

If you want to make absolutely sure, you need to construct the
UTF8Encoding yourself, specifying whether or not you want a BOM as a
parameter.

However, there's a much easier way of doing conversion than creating a
StreamWriter - just call Encoding.GetBytes(string). That will never
contain a BOM.

Guest · Apr 6, 2004

Thanks..

Encoding.GetBytes() and Encoding.GetString () worked much better than my clunky approach - and the BOM (or lack thereof) is consistent...

That's a great help

-mar

Getting around .Net Strings being UTF-16 encoded only	5	Nov 1, 2005
UTF-8 encoding in AJAX web application.	23	Mar 16, 2007
How to Load an Xml C# String into a DataSet	2	Jan 15, 2005
Byte Array not compatible.	1	Jun 30, 2005
a problem with encryption	5	Aug 10, 2004
vb.net:converting from byte array to string and back again	4	Apr 8, 2006
Dissappearing text in Byte[] conversion	5	Sep 7, 2005
Bug in BinaryWriter	2	Jul 12, 2007

System.Text.Encoding oddities

Guest

Jon Skeet [C# MVP]

Guest

Ask a Question

Similar Threads