HTTPWebrequest foreign characters are excluded in the response stream.

  • Thread starter Thread starter Mangesh
  • Start date Start date
M

Mangesh

hi,
I am using HTTPWebrequest object to download google results.
in the response stream I am not getting some foreign characters
eg. If I search "signo de pregunta", all the spanish characters are
missing in response stream.
The same search in the internet explorer shows all the characters.

I am sending all the required headers with the HTTPWebrequest.

Following code is used to get gogle results
-------------------------------------------------------------------
' Setup our Web request
objrequest = CType(WebRequest.Create(URL),
HttpWebRequest)
objrequest.Accept = "*/*"
objrequest.Headers.Add("Accept-Encoding", "gzip,
deflate")
objrequest.Headers.Add("Accept-Language", "en-us")
objrequest.ContentType = "text/html; charset=UTF-8"
objrequest.Timeout = TimeoutSeconds * 1000

' Retrieve data from request
objResponse = CType(objrequest.GetResponse,
HttpWebResponse)
'objStreamReceive = objResponse.GetResponseStream

objEncoding =
System.Text.Encoding.GetEncoding("utf-8")
objStreamRead = New
System.IO.StreamReader(objResponse.GetResponseStream,
Text.Encoding.UTF7)

' Set function return value
PageHTML = objStreamRead.ReadToEnd()
 
Mangesh said:
hi,
I am using HTTPWebrequest object to download google results.
in the response stream I am not getting some foreign characters
eg. If I search "signo de pregunta", all the spanish characters are
missing in response stream.
The same search in the internet explorer shows all the characters.

I am sending all the required headers with the HTTPWebrequest.

Following code is used to get gogle results
-------------------------------------------------------------------
' Setup our Web request
objrequest = CType(WebRequest.Create(URL),
HttpWebRequest)
objrequest.Accept = "*/*"
objrequest.Headers.Add("Accept-Encoding", "gzip,
deflate")
objrequest.Headers.Add("Accept-Language", "en-us")
objrequest.ContentType = "text/html; charset=UTF-8"
objrequest.Timeout = TimeoutSeconds * 1000

' Retrieve data from request
objResponse = CType(objrequest.GetResponse,
HttpWebResponse)
'objStreamReceive = objResponse.GetResponseStream

objEncoding =
System.Text.Encoding.GetEncoding("utf-8")
objStreamRead = New
System.IO.StreamReader(objResponse.GetResponseStream,
Text.Encoding.UTF7)

' Set function return value
PageHTML = objStreamRead.ReadToEnd()

-------------------------------------------------------------------

To be quite frank, there's a lot that's wrong with your code. But my main
concern is that you're using UTF-7 for decoding the web response. That's
just plain wrong. Use UTF-8 (you're constructing an instance without using
it) or ISO-8859-1.

Cheers,
 
Hi,
Yes, you may find some wrong things in the code but this is the result
of trying different things to get correct result.
some of the headers may not be required but I haven't removed them.
please see the below code as it works fine now.
I am using windows-1252 encoding instead of utf-8 and thats working
fine.

-----------------------------------------------
' Setup our Web request
objrequest = CType(WebRequest.Create(URL),
HttpWebRequest)

'headers
objrequest.Accept = "*/*"
objrequest.Headers.Add("Accept-Encoding", "gzip,
deflate")
objrequest.Headers.Add("Accept-Language", "en-us")
objrequest.Headers.Add("HTTP_USER_AGENT", "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705; .NET CLR
1.1.4322)")
objrequest.ContentType = "text/html; charset=UTF-8"

'timeout
objrequest.Timeout = TimeoutSeconds * 1000

' Retrieve data from request
objResponse = CType(objrequest.GetResponse,
HttpWebResponse)

'use windows encoding
objEncoding = System.Text.Encoding.GetEncoding(1252)

objStreamRead = New
System.IO.StreamReader(objResponse.GetResponseStream, objEncoding)

' Set function return value
getPageHTML = objStreamRead.ReadToEnd()
 
Mangesh said:
I am using HTTPWebrequest object to download google results.
in the response stream I am not getting some foreign characters
eg. If I search "signo de pregunta", all the spanish characters are
missing in response stream.
The same search in the internet explorer shows all the characters.

I am sending all the required headers with the HTTPWebrequest.

Following code is used to get gogle results
-------------------------------------------------------------------
' Setup our Web request
objrequest = CType(WebRequest.Create(URL),
HttpWebRequest)
objrequest.Accept = "*/*"
objrequest.Headers.Add("Accept-Encoding", "gzip,
deflate")
objrequest.Headers.Add("Accept-Language", "en-us")
objrequest.ContentType = "text/html; charset=UTF-8"
objrequest.Timeout = TimeoutSeconds * 1000

' Retrieve data from request
objResponse = CType(objrequest.GetResponse,
HttpWebResponse)
'objStreamReceive = objResponse.GetResponseStream

objEncoding =
System.Text.Encoding.GetEncoding("utf-8")
objStreamRead = New
System.IO.StreamReader(objResponse.GetResponseStream,
Text.Encoding.UTF7)

Well, you're assuming the response is in UTF-7, which it almost
certainly isn't. You need to find out what the response character set
actually *is*, and use that.
 
Mangesh Paranjape said:
Yes, you may find some wrong things in the code but this is the result
of trying different things to get correct result.
some of the headers may not be required but I haven't removed them.
please see the below code as it works fine now.
I am using windows-1252 encoding instead of utf-8 and thats working
fine.

I think it's unlikely that that's the correct way to do things though -
web servers really shouldn't be using code page 1252. It's more likely
it's sending back ISO-8859-1. You should use
HttpWebResponse.CharacterSet to find out what the server has told you
the response is in.
 
Hi,
I have already tried this.
HTTPWebresponse.CharacterSet property has null value.
I also tried HTTPWebresponse.ContentEncoding which is also empty. any
idea?

But Google html response comes with a header which says
"charset=ISO-8859-1".
I think I should change my encoding from winodws to ISO-8859-1.

Thanks for that,
-Mangesh
 
Mangesh Paranjape said:
I have already tried this.
HTTPWebresponse.CharacterSet property has null value.
I also tried HTTPWebresponse.ContentEncoding which is also empty. any
idea?

That sounds very odd.
But Google html response comes with a header which says
"charset=ISO-8859-1".

Hang on - how are you seeing that? Just from a browser, or what? It
should be present in the response from the web client too.
I think I should change my encoding from winodws to ISO-8859-1.

That would certainly be a start, use.
 
Hang on - how are you seeing that? Just from a browser, or >what? It
should be present in the response from the web client too.

I am sorry, I mean the response HTML stream contains "<meta" tag which
has charset attribute.
shown below

<meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1">

If you do the same search on the browser, the "<meta>" tag is different.
shown below

<meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8">

overall, using ISO-8859-1 is the best bet.

Thanks,
-Mangesh.
 
Mangesh Paranjape said:
I am sorry, I mean the response HTML stream contains "<meta" tag which
has charset attribute.
shown below

<meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1">

Ah, right. Shame it doesn't put it in the headers appropriately :(
If you do the same search on the browser, the "<meta>" tag is different.
shown below

<meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8">

overall, using ISO-8859-1 is the best bet.

Well, arguably making the same kind of request that the browser does,
and using UTF-8, would be better than using ISO-8859-1 as it wouldn't
be as restrictive.
 
Mangesh said:
Hi,
Yes, you may find some wrong things in the code but this is the result
of trying different things to get correct result.
some of the headers may not be required but I haven't removed them.
please see the below code as it works fine now.
I am using windows-1252 encoding instead of utf-8 and thats working
fine.

As Jon pointed out, Windows-1252 isn't really a common encoding for HTML
content. HttpWebResponse.ContentEncoding gives you the Content-Encoding
header -- stuff like gzip, deflate etc. HttpWebResponse.CharacterSet parses
the "charset" from the Content-type header, but it doesn't work for me
either...
objrequest.Headers.Add("HTTP_USER_AGENT", "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705; .NET CLR
1.1.4322)")

This header is called "User-Agent".

Cheers,
 
Back
Top