WebClient and Encoding

  • Thread starter Thread starter MaxMax
  • Start date Start date
M

MaxMax

Is it possible to tell to the WebClient to use an "automatic" encoding when
doing DownloadString? The encoding of the connection is written in the
header, so the WebClient should be able to sense it, but I wasn't able to
find the option. I can only use a fixed Encoding (UTF8 for example) and hope
the site use it.

--- bye
 
Hello MaxMax,

See HttpResponse.Charset and HttpResponse.ContentEncoding

---
WBR, Michael Nemtsev [.NET/C# MVP].
My blog: http://spaces.live.com/laflour
Team blog: http://devkids.blogspot.com/

"The greatest danger for most of us is not that our aim is too high and we
miss it, but that it is too low and we reach it" (c) Michelangelo

M> Is it possible to tell to the WebClient to use an "automatic"
M> encoding when doing DownloadString? The encoding of the connection is
M> written in the header, so the WebClient should be able to sense it,
M> but I wasn't able to find the option. I can only use a fixed Encoding
M> (UTF8 for example) and hope the site use it.
M>
M> --- bye
M>
 
M> Is it possible to tell to the WebClient to use an "automatic"
M> encoding when doing DownloadString? The encoding of the connection is
M> written in the header, so the WebClient should be able to sense it,
M> but I wasn't able to find the option. I can only use a fixed Encoding
M> (UTF8 for example) and hope the site use it.
See HttpResponse.Charset and HttpResponse.ContentEncoding

In the "classical" example of DownloadString from the MSDN:

{
WebClient client = new WebClient ();
string reply = client.DownloadString (address);

Console.WriteLine (reply);
}

I can't use the HttpResponse before I make the query.... And if I use it
later then it's useless: DownloadString has already decodified (using a
possibly wrong codepage) the stream to a CodePage.

--- bye
 
In the "classical" example of DownloadString from the MSDN:

{
WebClient client = new WebClient ();
string reply = client.DownloadString (address);

Console.WriteLine (reply);
}

I can't use the HttpResponse before I make the query.... And if I use it
later then it's useless: DownloadString has already decodified (using a
possibly wrong codepage) the stream to a CodePage.

--- bye

WebClient.DownloadString uses the encoding specified in the WebClient object when it converts the downloaded data to string. If you know the encoding in advance you can use WebClient.Encoding to set it to the properencoding, otherwise it will use Encoding.Default, which is the codepageused by your operating system.

If you don't know the Encoding in advance you probably should take a closer look at the HttpRequest/HttpResponse classes. The trick is to download it as a byte[], then using the information provides by the headers toconvert it to the proper string format.
 
WebClient internally uses a WebRequest to do the downloading; and it will
use WebRequest.ContentType to search for "charset" header as the encoding.
If the ContentType/charset header doesn't exist or contains invalid
charset, WebClient.Encoding is used (which is Encoding.Default by default
or you can assign it before hand); however you should be aware that
WebClient.Encoding is used as a fallback, if the response contains a valid
encoding, it's always used to decode the returned data.

For a HttpWebRequest, the ContentType is from the HttpWebResponse. You can
use Fiddler (http://www.fiddlertool.com/) to trace the http headers and
see if WebClient used the correct Encoding to return the string.


Regards,
Walter Wang ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
Walter Wang said:
WebClient internally uses a WebRequest to do the downloading; and it will
use WebRequest.ContentType to search for "charset" header as the encoding.
If the ContentType/charset header doesn't exist or contains invalid
charset, WebClient.Encoding is used (which is Encoding.Default by default
or you can assign it before hand); however you should be aware that
WebClient.Encoding is used as a fallback, if the response contains a valid
encoding, it's always used to decode the returned data.
I'm pretty sure it isn't so. If I set Encoding to (for example) UTF32 the
WebClient throws an exception. And if I have a page with an UTF8 character
(a page that in the WebRequest IS correctly shown as UTF8 page) and I don't
set the Encoder I receive a wrong String.

--- bye
 
I'm pretty sure it isn't so. If I set Encoding to (for example) UTF32 the
WebClient throws an exception. And if I have a page with an UTF8 character
(a page that in the WebRequest IS correctly shown as UTF8 page) and I don't
set the Encoder I receive a wrong String.

--- bye

Try this code. It attemps to get the CharacterSet in various ways and falls back to UTF-8. Checking for ContentEncoding may not be necessary as I have yet to see it specified. The code is a bit of cut and paste and you may have to tweak it to get it running.

public string DownloadPage(url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);

using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{

using (Stream s = resp.GetResponseStream())
{
buffer = ReadStream(s);
}

string pageEncoding = "";
Encoding e = Encoding.UTF8;
if (resp.ContentEncoding != "")
pageEncoding = resp.ContentEncoding;
else if (resp.CharacterSet != "")
pageEncoding = resp.CharacterSet;
else if (resp.ContentType != "")
pageEncoding = GetCharacterSet(resp.ContentType);

if(pageEncoding == "")
pageEncoding = GetCharacterSet(buffer);

if (pageEncoding != "")
{
try
{
e = Encoding.GetEncoding(pageEncoding);
}
catch
{
MessageBox.Show("Invalid encoding: " + pageEncoding);
}
}

string data = e.GetString(buffer);

Status = "";

return data;
}
}

private string GetCharacterSet(string s)
{
s = s.ToUpper();
int start = s.LastIndexOf("CHARSET");
if (start == -1)
return "";

start = s.IndexOf("=", start);
if (start == -1)
return "";

start++;
s = s.Substring(start).Trim();
int end = s.Length;

int i = s.IndexOf(";");
if (i != -1)
end = i;
i = s.IndexOf("\"");
if (i != -1 && i < end)
end = i;
i = s.IndexOf("'");
if (i != -1 && i < end)
end = i;
i = s.IndexOf("/");
if (i != -1 && i < end)
end = i;

return s.Substring(0, end).Trim();
}

private string GetCharacterSet(byte[] data)
{
string s = Encoding.Default.GetString(data);
return GetCharacterSet(s);
}

private byte[] ReadStream(Stream s)
{
try
{
byte[] buffer = new byte[8096];
using (MemoryStream ms = new MemoryStream())
{
while (true)
{
int read = s.Read(buffer, 0, buffer.Length);
if (read <= 0)
{
CurLength = 0;
return ms.ToArray();
}
ms.Write(buffer, 0, read);
CurLength = ms.Length;
}
}
}
catch (Exception ex)
{
return null;
}
}
 
Hi MaxMax,

I've done some test and it seems my previous comment isn't correct. Sorry
about that.

Please use Morten's posted code to detect the encoding and read the text
correctly.

I will consult this question within our internal discussion list to see if
this is a known issue.

Regards,
Walter Wang ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
We have confirmed this is an issue in WebClient. I've filed an internal bug
for it.

Thanks for the feedback!

Regards,
Walter Wang ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
Back
Top