XmlDocument LoadXml()- problem with utf8 xml

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hi,
I'm using WebClient to download a XML file from a remote server. I then save
the xml into a string.
The problem is when i use XmlDocument LoadXml() on that string. I get the
following exception:
"System.Xml.XmlException was unhandled
Message="Data at the root level is invalid. Line 1, position 1."

If i save the xml file (on the remote server) as ASCII file then there is no
problem. For some reason, the LoadXml() function cannot handle the utf8 file
format! Of course my xml is declared as encoding="utf-8".
This is my code:

string link = "http://myServer/Test.xml";
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
string test = client.DownloadString(link);
client.Dispose();
XmlDocument testXML = new XmlDocument();
testXML.LoadXml(test); // <-- Here i get the exception

I'll appreciate your help
10x
 
barbutz said:
I'm using WebClient to download a XML file from a remote server. I then save
the xml into a string.
The problem is when i use XmlDocument LoadXml() on that string. I get the
following exception:
"System.Xml.XmlException was unhandled
Message="Data at the root level is invalid. Line 1, position 1."

If i save the xml file (on the remote server) as ASCII file then there is no
problem. For some reason, the LoadXml() function cannot handle the utf8 file
format! Of course my xml is declared as encoding="utf-8".

It may be *declared* as UTF-8, but is it *actually* UTF-8?

Could you post a short but complete program which demonstrates the
problem?

See http://www.pobox.com/~skeet/csharp/complete.html for details of
what I mean by that.

You shouldn't need to have any loading involved, by the way - just a
piece of code which uses a string literal should be sufficient. My
guess is that while trying to reproduce it, you'll find that the string
you've got from the WebClient isn't what you think it is.
 
Did you try to write out/log the contents of "test" to see what it
looked like before passing it to LoadXml ? You should have understood
why you got such XmlException.

Thi
 
Ok, i debugged my program and i realize the following thing:
The xml file is saved in utf8 format. That means that its first 3 bytes are
binary which represents utf8: EF BB BF.
Now when i read it with WebClient to a string then string contains the whole
xml data INCLUDING those 3 bytes. Now when i load this string with LoadXml i
get an execption because of those 3 bytes. If i remove those bytes from the
file by Hex Editor then there is no problem but this is not a good solution
cause it turns my file to simple ASCII.
How can i solve this issue without touching the xml file ?
 
I suggest you try as follows:

string link = "http://myServer/Test.xml";
WebClient client = new WebClient();
byte[] theBytes = client.DownloadData(link);
string test = Encoding.UTF8.GetString(theBytes);
client.Dispose();
XmlDocument testXML = new XmlDocument();
testXML.LoadXml(test);

I did not test it yet, but hope it could help,
Thi
 
barbutz said:
Hi,
I'm using WebClient to download a XML file from a remote server. I
then save the xml into a string.
The problem is when i use XmlDocument LoadXml() on that string. I get
the following exception:
"System.Xml.XmlException was unhandled
Message="Data at the root level is invalid. Line 1, position 1."

If i save the xml file (on the remote server) as ASCII file then
there is no problem. For some reason, the LoadXml() function cannot
handle the utf8 file format! Of course my xml is declared as
encoding="utf-8". This is my code:

string link = "http://myServer/Test.xml";
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
string test = client.DownloadString(link);
client.Dispose();
XmlDocument testXML = new XmlDocument();
testXML.LoadXml(test); // <-- Here i get the exception

What happens is that Encoding.UTF8.GetString() doesn't strip away the
BOM if one exists. I'm not sure whether that's by design -- to me it's
rather a bug.

You have to choices: Strip away the BOM yourself or use
XmlDocument.Load() to read the XML content directly from a URL.

Cheers,
 
Joerg Jooss said:
What happens is that Encoding.UTF8.GetString() doesn't strip away the
BOM if one exists. I'm not sure whether that's by design -- to me it's
rather a bug.

You have to choices: Strip away the BOM yourself or use
XmlDocument.Load() to read the XML content directly from a URL.

Hmm. It feels as much a bug in the XmlDocument.LoadXml() call as
anywhere else. Certainly if this were presented as *binary* data it
should be okay - the XML specification mentioned BOMs particularly.

For anyone who's interested, here's a short but complete program
demonstrating it:

using System;
using System.Xml;

class Test
{
static void Main()
{
try
{
string x = "\ufeff<?xml version='1.0'?><hello/>";

XmlDocument doc = new XmlDocument();
doc.LoadXml(x);
}
catch (Exception e)
{
Console.WriteLine (e);
}
}
}
 
Thanks for all of your replies.
The second option sounds good, but how can i read the xml file using
XmlDocument.Load() ? The xml file is sitting in a remote web server not local
- that's why i used WebClient in the first place. Is there a way to use
XmlDocument.Load() to load an xml file that is located in a remote http
server?

Thanks!
 
barbutz said:
Thanks for all of your replies.
The second option sounds good, but how can i read the xml file using
XmlDocument.Load() ? The xml file is sitting in a remote web server
not local - that's why i used WebClient in the first place. Is there
a way to use XmlDocument.Load() to load an xml file that is located
in a remote http server?

Yes. Actually, the string parameter in Load(fileName) is documented as
follows:

"URL for the file containing the XML document to load."

That means you should be able to pass any valid URL. If it starts with
the HTTP scheme, the file is downloaded from the Web. Unless you need
control over the HTTP communication, that's the easiest way to load an
XML file.

Cheers,
 
Back
Top