Object serialization and NetworkStream - extraneous characters in output

  • Thread starter Thread starter jwallison
  • Start date Start date
J

jwallison

TcpClient client = new TcpClient(AddressFamily.InterNetwork);
client.SendTimeout = mSvcConfig.Data.SvcTimeout; // 1000
client.Connect( mSvcConfig.Data.SvcAddress, mSvcConfig.Data.SvcPort);
//"localhost", 7024
NetworkStream stream = client.GetStream();


XmlSerializer outserializer = new XmlSerializer(typeof(LinkMessage));
//my data object, all string/int data
XmlTextWriter tw = new XmlTextWriter( stream, Encoding.UTF8);

outserializer.Serialize(tw, mMsg ); // ref to my LinkMessage data
instance

stream.Flush();
client.Close();



Produces the following output when written via the TcpClient stream (note
extraneous "o;?" at beginning of message):

o;?<?xml version="1.0" encoding="utf-8"?><LinkMessage
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><MessageType>Anchor</M
essageType><InnerText>Client Side ImageMap</InnerText><Href>http:
//www.he.net/~seidel/Map/clientmap.html</Href><ImageSrc /></LinkMessage>

but produces the same output, sans garbage, when the same code writes to an
XmlTextWriter based on a disk file (i.e. - seems like changing only the
stream type results in spurious "added" output) on a NetwrokStream.

If the encoding is changed to Encoding.Unicode, different garbage (¦~) prior
to the actual message.
If Encoding.ASCII, no garbage - but also wrong encoding in the emitted XML.

What can I do to eliminate this leading junk at the beginning of my
messages? The Java app that is a target for this socket communication can't
handle it...

TIA

--
Regards,

Jim Allison
(e-mail address removed)
(de-mung by removing '.1')
(e-mail address removed)
 
Hi Jim ,

Thanks for your posting. From your description, you're using the dotnet's
XmlSerializer to serialize a certain class instance out to a NetWorkStream
and at the other side, when you retrieve the stream and try reading the
xmlcontent out, you found there is an additional header "o;?" at the
begining of the xml stream ,yes?

As for the problem you mentioned, I think it is likely due to the encoding
problem. First, as for UNICODE text stream, there will has a header which
indicate the Unicode stream's encoding type. And "o;?" is the one for
UTF-8, and when using other ones such as UTF-16, you will get other value(
ASCII stream won't have such a header). To verify this, you can also use a
UltraEdit to open a unicode(UTF-8) txt file and use hex format to see it,
you'll found the header, it is composed of three bytes 239,187,191 ,
they're all ascii char, and will display as "o;?" if you print them as
ascii string. For example:

byte[] bytes = {239,187,191};
MessageBox.Show(System.Text.Encoding.ASCII.GetString(bytes));

So, when you use XmlSerializer to serialize an object into a certain
stream, if using Unicode encoding type(use UTF-8 for instance), the header
will be added( the first three bytes). However, if you read the xml back
from the stream via UTF-8 encoding, you won't get this three bytes, the
UTF-8 encoding system will automatically remove the header and return the
sequential bytes bebind the header. Here is a simple code snippet to show
this:

===============================
byte[] buffer = null;

XmlSerializer serializer = new XmlSerializer(typeof(userInfo));

userInfo ui = new userInfo();
ui.userName = "steven cheng";
ui.age = 20;
ui.email = "(e-mail address removed)";

MemoryStream ms = new MemoryStream();

StreamWriter sw = new StreamWriter(ms,System.Text.Encoding.UTF8);

serializer.Serialize(sw,ui);

buffer = ms.GetBuffer();

// will return the xml with "o;?" because we use ASCII to decode the byte
which is incorrect
MessageBox.Show(System.Text.Encoding.ASCII.GetString(buffer));

// won't display the "o;?" since the UIF-8(correct encoding) will bypass it
MessageBox.Show(System.Text.Encoding.UTF8.GetString(buffer));
==================================

So, If you found the problems occur in your java client that recieve this
stream, I suggest you check the java code to see whether it is reading the
stream and conver the bytes to string using the correct encoding
type(utf-8). I suspect that it is using the default ASCII encoding to read
the bytes so that the "o;?" come out.

Please have a look at the above things, if there is anything unclear,
please feel free to post here.
HTH.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)
 
My .Net socket test client WAS erroneously using Encoding.ASCII (ah, the
joys of midnight testing!), changing that to UTF8 produces the same result
that the Java developer is reporting - a "?" is received at the beginning
of every deserialized message on the socket.

So, the "o;" is the encoding information on the packet, but the "?" is
extraneous.

What is the source of the extraneous character, and can it/should it be
eliminated? I seem to recall something like this from the days of DOS - is
it just an artifact of Socket communications in general?

Steven Cheng said:
Hi Jim ,

Thanks for your posting. From your description, you're using the dotnet's
XmlSerializer to serialize a certain class instance out to a NetWorkStream
and at the other side, when you retrieve the stream and try reading the
xmlcontent out, you found there is an additional header "o;?" at the
begining of the xml stream ,yes?

As for the problem you mentioned, I think it is likely due to the encoding
problem. First, as for UNICODE text stream, there will has a header which
indicate the Unicode stream's encoding type. And "o;?" is the one for
UTF-8, and when using other ones such as UTF-16, you will get other value(
ASCII stream won't have such a header). To verify this, you can also use a
UltraEdit to open a unicode(UTF-8) txt file and use hex format to see it,
you'll found the header, it is composed of three bytes 239,187,191 ,
they're all ascii char, and will display as "o;?" if you print them as
ascii string. For example:

byte[] bytes = {239,187,191};
MessageBox.Show(System.Text.Encoding.ASCII.GetString(bytes));

So, when you use XmlSerializer to serialize an object into a certain
stream, if using Unicode encoding type(use UTF-8 for instance), the header
will be added( the first three bytes). However, if you read the xml back
from the stream via UTF-8 encoding, you won't get this three bytes, the
UTF-8 encoding system will automatically remove the header and return the
sequential bytes bebind the header. Here is a simple code snippet to show
this:

===============================
byte[] buffer = null;

XmlSerializer serializer = new XmlSerializer(typeof(userInfo));

userInfo ui = new userInfo();
ui.userName = "steven cheng";
ui.age = 20;
ui.email = "(e-mail address removed)";

MemoryStream ms = new MemoryStream();

StreamWriter sw = new StreamWriter(ms,System.Text.Encoding.UTF8);

serializer.Serialize(sw,ui);

buffer = ms.GetBuffer();

// will return the xml with "o;?" because we use ASCII to decode the byte
which is incorrect
MessageBox.Show(System.Text.Encoding.ASCII.GetString(buffer));

// won't display the "o;?" since the UIF-8(correct encoding) will bypass it
MessageBox.Show(System.Text.Encoding.UTF8.GetString(buffer));
==================================

So, If you found the problems occur in your java client that recieve this
stream, I suggest you check the java code to see whether it is reading the
stream and conver the bytes to string using the correct encoding
type(utf-8). I suspect that it is using the default ASCII encoding to read
the bytes so that the "o;?" come out.

Please have a look at the above things, if there is anything unclear,
please feel free to post here.
HTH.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)
 
Hi Jwallison,

Thanks for your followup. I've just done some tests between .net and java.
Write file in .net and read in java, write in java and read in .net( via
UTF-8). First, I've found the problem you mentioned, when reading into java
stream via UTF-8, an additional "?" occurs. But if I use JAVA to write out
a utf-8 encoded xml file, I can load it correctly in .net.

I think there is something different of the file's output between .net and
java. I'll do some research and have some further test to check this. I'll
update you if I got some info. Also, if you find any ideas meanwhile,
please also feel free to post here. Thanks.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)
 
It doesn't just happen in Java, it ALSO happens with .Net -

private const int portNum = 7024;

public static int Main(String[] args)
{
bool done = false;

IPAddress localAddr = IPAddress.Parse("127.0.0.1");

TcpListener listener = new TcpListener(localAddr, portNum);

listener.Start();

while (!done)
{
Console.Write("\nWaiting for connection...");
TcpClient client = listener.AcceptTcpClient();

Console.WriteLine("Connection accepted.");
NetworkStream ns = client.GetStream();

try
{
byte[] bytes = new byte[2048];
int bytesRead;

while( (bytesRead = ns.Read(bytes, 0, bytes.Length)) > 0)
Console.WriteLine(Encoding.UTF8.GetString(bytes, 0, bytesRead));

ns.Close();
client.Close();
}
catch (Exception e)
{
Console.WriteLine(e.ToString());
}
}

listener.Stop();

return 0;
}

returns output identical to the Java client (with a leading "?").
 
Hi Jwallison,

Thanks for your followup. Ok, I've just made some test on Console
Application and see the "?" you mentioned which I didn't see in Windows
Application(via messagebox) when using the Encoding.UTF8.GetString().

However, as I mentioned in the first message, this is still caused by the
byte order Mark(BOM) which is inserted ahead of a Stream( contains UNICODE
text ). And the "?" is just the BOM of UTF-8 , it is a three bytes header
{239,187,191}

and we can also verify this by using the
//get the UTF-8 BOM
byte[] bom = System.Text.Encoding.UTF8.GetPreamble();
Console.WriteLine(System.Text.Encoding.UTF8.GetString(bom));

and you can see the "?" you mentioned.

In .net, the StreamWriter will always add such BOM(for unicode encoding
type) into the output byte stream, but when we use StreamReader(rather than
the Raw Stream such as FileStream or NetworkStream) to read it back, we
won't by affected by such BOM, the StreamReader will automatically detect
it and process it for us. So when we want to retreive unicode text from a
Stream, we need to use a
StreamReader (with the correct encoding type ) to wrapper the Raw Stream,
for example:
================================
while (!done)
{
Console.Write("\nWaiting for connection...");
TcpClient client = listener.AcceptTcpClient();

Console.WriteLine("Connection accepted.");
NetworkStream ns = client.GetStream();
StreamReader sr = new StreamReader(ns,System.Text.Encoding.UTF8);
try
{
byte[] bytes = new byte[2048];
int bytesRead;

Console.WriteLine(sr.ReadToEnd());

ns.Close();
client.Close();
}
catch (Exception e)
{
Console.WriteLine(e.ToString());
}
}

This can ensure that the raw stream is correctly read in via the
appropricate encoding.

In addition, based on my test on java IO, the java io's Reader (with the
specific encoding) won't detect such BOM, also its Writer class won't
output BOM either. So the problem will still occur when you use JAVA's IO
Reader to read the unicode text stream output by .net. From my search, I
see some one is manualy detect the first 4 bytes to see whether it's a
certain BOM when reading unicode text stream in java, but haven't found any
buildin means like the StreamReader in .net.

Please have a look in the above things and if there is anything unclear,
please feel free to post here .Thanks.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)
 
Back
Top