HTML/XML character encoding getting changed

Jon Davis · Nov 2, 2003

I have a software application I've written called PowerBlog (PowerBlog.net)
that takes the editing capability of the Internet Explorer WebBrowser
control (essentially a DHTMLTextBox), extracts the user-typed HTML, assigns
it as an XML node's InnerText property (using C#: System.Xml.XmlDocument
obj; obj.InnerText = myHTML). Then I later get the InnerText as a string and
write to disk.

When this text is displayed in a web browser, special characters that are
beyond the standard ASCII charset are not rendered correctly. Frequently, I
have copied text from a web site, pasted in the DHTMLTextbox, saved, and
published it, and my published output has corrupt characters. However, prior
to publishing, when previewing my document it looks fine -- it is only when
it is published (extracted, written to disk, uploaded to the server via FTP,
downloaded via HTTP) that the corruption occurs.

There are several places where this problem could be occurring, and I don't
know how to figure it out.

- A "design feature" in the XmlNode's InnerText property that converts the
&###; encoding into an actual character.
- An encoding flaw when written to disk (currently I'm using the default,
UTF-8 I guess).
- A flaw in the FTP client class where the file is being corrupted during
upload (I think I'm using binary upload format but perhaps I should
double-check).
- A flaw in IIS (no known strange settings exist)

I still need to do some homework on this but I was wondering if anyone has
any bright ideas before I continue searching this out?

Thanks,
Jon

Tobin Harris · Nov 2, 2003

Jon Davis said:
- A "design feature" in the XmlNode's InnerText property that converts the
&###; encoding into an actual character.
- An encoding flaw when written to disk (currently I'm using the default,
UTF-8 I guess).
- A flaw in the FTP client class where the file is being corrupted during
upload (I think I'm using binary upload format but perhaps I should
double-check).
- A flaw in IIS (no known strange settings exist)

For starters I'd rule out the last two options - I think it's almost got to
be in character encoding or the way you're writing it to disk.

As you notice, if the source code text in your DHTML component is stored in
a different encoding to the format you're using to write to disk, then
you'll lose information, or it will be written incorrectly. Most encodings
store ascii characters upto 128 the same, so errors only become obvious
after 128.

I'd be interested to find out what encoding the DHTML control is using to
store its source code. UCS-2 is, as far as I'm aware, the standard windows
encoding, so you might want to try writing out to disk using this encoding
rather than UTF-8. The streamwriters let you set the encoding before
writing. Hopefully you'll not get any loss of information, which is what is
happening when you try to write UCS-2 as UTF-8! Just a guess, but worth a
try!?

HTH

Tobin

Jon Davis · Nov 4, 2003

Thanks Tobin. I'll check out UCS-2, et al.

Jon

Tobin Harris said:
For starters I'd rule out the last two options - I think it's almost got to
be in character encoding or the way you're writing it to disk.

As you notice, if the source code text in your DHTML component is stored in
a different encoding to the format you're using to write to disk, then
you'll lose information, or it will be written incorrectly. Most encodings
store ascii characters upto 128 the same, so errors only become obvious
after 128.

I'd be interested to find out what encoding the DHTML control is using to
store its source code. UCS-2 is, as far as I'm aware, the standard windows
encoding, so you might want to try writing out to disk using this encoding
rather than UTF-8. The streamwriters let you set the encoding before
writing. Hopefully you'll not get any loss of information, which is what is
happening when you try to write UCS-2 as UTF-8! Just a guess, but worth a
try!?

HTH

Tobin

Mihai N. · Nov 4, 2003

Hopefully you'll not get any loss of information, which is what is
happening when you try to write UCS-2 as UTF-8!

Unlikely. UCS2 or UTF8 are two different representations of the same
character set (Unicode). There is no loss of info when you convert from one
to the other (if the conversion is correctly done).

Unlikely. Even if the binary is not set, the only damaged characters will be
the control characters (below 0x20).
Most probable. As a test, add this to the html file, first one in the
<head> section:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">

Without it the browser will assume the default is iso-8859-1.

Mihai

InnerText in an XML file	6	Mar 28, 2008
Testing xml string is well formed xml	2	Sep 12, 2008
Reading XML with HTML tags	7	Jun 15, 2006
XML Nodes Iteration Question	2	Mar 6, 2006
How to decode 'safe' html back to original raw text?	2	Feb 12, 2007
Invalid character returned when reading UTF-8 XML	7	Jun 30, 2008
Encoding characters for HTML	2	May 20, 2006
How to WebBrowser.DocumentText with right encoding	5	Jul 17, 2009

HTML/XML character encoding getting changed

Jon Davis

Tobin Harris

Jon Davis

Mihai N.

Ask a Question

Similar Threads