HTMLEncode: low surrogate char Error?

  • Thread starter Thread starter Marta Pia
  • Start date Start date
M

Marta Pia

Hello,

I'm using C# to write an html based report using keywords stored in a
database whose input I don't control. Before sending the strings to
HTML, I run them through the HttpUtility.HtmlEncode(strIn) function to
prevent my html from acting funny. Today the following error popped
up: " An unexpected exception occurred
System.ArgumentException: Found a low surrogate char without a
preceding high surrogate at index: 640. The input may not be in this
encoding, or may not contain valid Unicode (UTF-16) characters."

Any ideas? Is there anyway to to an HtmlEncode with UTF-8 bit?

Here is the affected code...

bResult = CommonUtil.EncodeForHTML (strKeywords, ref strConvert);
if (bResult) strKeywords = strConvert;

if (strKeywords.Length >1)
{
strDetail += "<TR><TH> <DIV class=HF> Keywords </DIV></TH>\r\n";
strDetail += "<TD colspan = 7> <DIV class= DF>" + strKeywords +
"</DIV></TD> </TR>\r\n";
}
fReport.WriteLine(strDetail); <<< WHERE ERROR OCCURS

public static bool EncodeForHTML(string strIn, ref string strOut)
{
try
{
if (strIn.Length < 1) return false;
strOut = HttpUtility.HtmlEncode(strIn);
return true;

}
catch
{
return false;
}

Thank you,
Marta
 
Marta Pia said:
I'm using C# to write an html based report using keywords stored in a
database whose input I don't control. Before sending the strings to
HTML, I run them through the HttpUtility.HtmlEncode(strIn) function to
prevent my html from acting funny. Today the following error popped
up: " An unexpected exception occurred
System.ArgumentException: Found a low surrogate char without a
preceding high surrogate at index: 640. The input may not be in this
encoding, or may not contain valid Unicode (UTF-16) characters."

If you're getting an exception like that, it suggests you've got some
very dodgy data to start with. Have you examined it to look at the
character being complained about?
 
Jon Skeet said:
If you're getting an exception like that, it suggests you've got some
very dodgy data to start with. Have you examined it to look at the
character being complained about?

Oh yes, the characters are dodgy. I am trying to decode which one
actually tripped up the writeline/encode. I might need to strip all
non-printing characters out of the string before writing it to the
file (although, previous to this one, the presence of non-printing
characters didn't cause an exception). Is there an .net function to
strip out non printing characters or should I write a function to go
through the string character by character?

That aside, why does the character save into a string and encode
without error, but when I try to write it, it fails... ?

Take Care,
Marta
 
Marta Pia said:
Oh yes, the characters are dodgy. I am trying to decode which one
actually tripped up the writeline/encode. I might need to strip all
non-printing characters out of the string before writing it to the
file (although, previous to this one, the presence of non-printing
characters didn't cause an exception). Is there an .net function to
strip out non printing characters or should I write a function to go
through the string character by character?

Well, you could do that. I would think the first port of call should be
working out how you got dodgy data to start with though.
That aside, why does the character save into a string and encode
without error, but when I try to write it, it fails... ?

Chars are just 16-bit numbers, and a lot of routines will just treat
them as such, whether they're surrogates or not. I suspect that it's
when the string is written out, it is the process of encoding it to a
byte array for transmission over the wire that notices the problem.
 
Back
Top