HTMLEncode: low surrogate char Error

  • Thread starter Thread starter Alexander Higgins
  • Start date Start date
A

Alexander Higgins

Thanks for the response....
But that's the point - using arbitrary binary data as if it were
real text data is *wrong*. The data is effectively "dodgy" - just as
if you'd tried to edit a jpeg as if it were a text file.

Point Taken but this is not the case. Thus, if a person writes a text
file on her or his computer and does not use UNICODE to save it, the
current code page is used. If this file is given to someone with some
other current codepage, the file is not displayed correctly. Simply
converting the file to Unicode will make the data display properly.
When performing the encoding process the encoding will escape
incorrect caharacters instead of attempting to interpret them. During
the Encode Decode process you may see conversion like Ãœ = Ü, â„¢ =
™, à = Á, Ω = Ω, etc. Eventually you willhave non-
UTF characters that are part of the default windows code page throw an
error. By specifying the the system.text.encoding as part of the
streamwriter, you will avoid throwing the exception.

Additionally, that data could also be url Encoded, %20="Space". The
Percent sign indicates to use the Hexidecimal equivalent of the the
char(); chr(20). Injection hackers will use %00 for null injection
attacks or use %10%13 for char(10) & chr(13) etc.

Considering all of the above, there are plenty of cases where you will
have data that is clean but is represented by different characters in
different encodings. Different operation systems have different new
line definitions. While Windows uses CRLF (Carriage Return plus Line
Feed), UNIX uses only CR. Addiotionally you may see some encoders
convert <BR> to line feeds and vice versa.

To reproduce this issue....

Copy this into a text file in a Visual Studio Project and save it as
"Read_Me.txt."

==========Begin Read_Me.txt

1) Create New Web project and copy the entire contents of this folder
into the projects root folder. Select yes to all prompts.

2) Browse to the Cms Folder, Right click and choose Exlude from
Project. Right Click The solution and choose "Add existing Project".
Browse to the Cms Folder and Choose CMS.vbproj, then add a reference
to the CMS Project to you Web Project.

4) Add a reference to the freeTextBox.dll in the /framework1.1 folder.

4) Browse to /admin/install.aspx, right click and choose view in
broswer. Follow the set up instructions.


============end Read_Me.txt

Now right click the file and choose properties, then select build
action and choose embedded resource. Create a new class names
Resources.vb and add this code.

Imports System.IO
Imports System.Reflection
Imports System.Xml
Public Class Resources

Dim _textStreamReader As StreamReader
Dim _assembly As [Assembly]
Sub New()
End Sub

Function GetResource(ByVal ResourceName As String)

_assembly = [Assembly].GetExecutingAssembly()
If _assembly Is Nothing Then
Throw New Exception("assembly is nothing")
End If
Dim stream As IO.Stream =
_assembly.GetManifestResourceStream("AssemblyName." & ResourceName)

If stream Is Nothing Then
Throw New Exception("stream is nothing")
End If

_textStreamReader = New StreamReader(stream)
Return Me._textStreamReader.ReadToEnd
End Function

Now Open a web page in the page load sub add the following code:

Dim resources As New Resources
Dim Code As String

Try
code = resources.GetResource(ResourceName)
Catch ex As Exception
log("Resource : " & ResourceName & " is nothing", LogFile)
End Try

If Not code Is Nothing Then
Dim Sw As New IO.StreamWriter(FileName, False)
Sw.Write(Code)
Sw.Close()

End If

When you execute this code the surroage error is thrown. Why, because
the Text file was embedded using the windows code page. The fix

If Not code Is Nothing Then
Dim Sw As New IO.StreamWriter(FileName, False,
System.Text.Encoding.GetEncoding(1252)
)
Sw.Write(Code)
Sw.Close()

End If

Clearly you'll see the data is written to the text file in it's
original format, with no funky characters and no data corruption.

Hope this helps give you a better understanding of the process.



Alex Higgins
http://alexanderhiggins.com




Any time that you've read in text data with the wrong encoding, your
string has the wrong data in it, and therefore the data is dodgy.

Do you see what I mean?

Jon




--------------------------------------------------------------------------------


Subject: Re: HTMLEncode: low surrogate char Error?
Date: Fri, 27 Jul 2007 19:03:52 +0100
alex higgins wrote:

Thanks for the response....
Right.


But that's the point - using arbitrary binary data as if it were
real text data is *wrong*. The data is effectively "dodgy" - just as
if you'd tried to edit a jpeg as if it were a text file.
Any time that you've read in text data with the wrong encoding, your
string has the wrong data in it, and therefore the data is dodgy.

Do you see what I mean?Jon

Hello,

I'm using C# to write an html based report using keywords stored in a
database whose input I don't control. Before sending the strings to
HTML, I run them through the HttpUtility.HtmlEncode(strIn) function
to
prevent my html from acting funny. Today the following error popped
up: " An unexpected exception occurred
System.ArgumentException: Found a low surrogate char without a
preceding high surrogate at index: 640. The input may not be in this
encoding, or may not contain valid Unicode (UTF-16) characters."


Any ideas? Is there anyway to to an HtmlEncode with UTF-8 bit?


Here is the affected code...


bResult = CommonUtil.EncodeForHTML (strKeywords, ref strConvert);
if (bResult) strKeywords = strConvert;


if (strKeywords.Length >1)
{
strDetail += "<TR><TH> <DIV class=HF> Keywords </DIV></TH>\r\n";
strDetail += "<TD colspan = 7> <DIV class= DF>" + strKeywords +
"</DIV></TD> </TR>\r\n";


}


fReport.WriteLine(strDetail); <<< WHERE ERROR OCCURS

public static bool EncodeForHTML(string strIn, ref string strOut)
{
try
{
if (strIn.Length < 1) return false;
strOut = HttpUtility.HtmlEncode(strIn);
return true;



}


catch
{
return false;


}


Thank you,
Marta


Marta Pia
I'm using C# to write an html based report using keywords stored in a
database whose input I don't control. Before sending the strings to
HTML, I run them through the HttpUtility.HtmlEncode(strIn) function to
prevent my html from acting funny. Today the following error popped
up: " An unexpected exception occurred
System.ArgumentException: Found a low surrogate char without a
preceding high surrogate at index: 640. The input may not be in this
encoding, or may not contain valid Unicode (UTF-16) characters."


If you're getting an exception like that, it suggests you've got some
very dodgy data to start with. Have you examined it to look at the
character being complained about?


Oh yes, the characters are dodgy. I am trying to decode which one
actually tripped up the writeline/encode. I might need to strip all
non-printing characters out of the string before writing it to the
file (although, previous to this one, the presence of non-printing
characters didn't cause an exception). Is there an .net function to
strip out non printing characters or should I write a function to go
through the string character by character?



Well, you could do that. I would think the first port of call should
be
working out how you got dodgy data to start with though.

That aside, why does the character save into a string and encode
without error, but when I try to write it, it fails... ?


Chars are just 16-bit numbers, and a lot of routines will just treat
them as such, whether they're surrogates or not. I suspect that it's
when the string is written out, it is the process of encoding it to a
byte array for transmission over the wire that notices the problem.
 
Alexander Higgins said:
Point Taken but this is not the case. Thus, if a person writes a text
file on her or his computer and does not use UNICODE to save it, the
current code page is used. If this file is given to someone with some
other current codepage, the file is not displayed correctly. Simply
converting the file to Unicode will make the data display properly.

Yes - that means the *original* data is correct. That's fine - but the
data in the form loaded with the incorrect code page is invalid.

I can have a perfectly valid image file on disk, but if I load it and
throw away the high bit of every byte, the loaded version will be
"dodgy" will it not?

I believe that any string which contains only half of a surrogate pair
either comes from bad data to start with, or has been loaded
inappropriately, resulting in bad data in memory.
 
Back
Top