HTML/XML character encoding getting changed

J

Jon Davis

I have a software application I've written called PowerBlog (PowerBlog.net)
that takes the editing capability of the Internet Explorer WebBrowser
control (essentially a DHTMLTextBox), extracts the user-typed HTML, assigns
it as an XML node's InnerText property (using C#: System.Xml.XmlDocument
obj; obj.InnerText = myHTML). Then I later get the InnerText as a string and
write to disk.

When this text is displayed in a web browser, special characters that are
beyond the standard ASCII charset are not rendered correctly. Frequently, I
have copied text from a web site, pasted in the DHTMLTextbox, saved, and
published it, and my published output has corrupt characters. However, prior
to publishing, when previewing my document it looks fine -- it is only when
it is published (extracted, written to disk, uploaded to the server via FTP,
downloaded via HTTP) that the corruption occurs.

There are several places where this problem could be occurring, and I don't
know how to figure it out.

- A "design feature" in the XmlNode's InnerText property that converts the
&###; encoding into an actual character.
- An encoding flaw when written to disk (currently I'm using the default,
UTF-8 I guess).
- A flaw in the FTP client class where the file is being corrupted during
upload (I think I'm using binary upload format but perhaps I should
double-check).
- A flaw in IIS (no known strange settings exist)

I still need to do some homework on this but I was wondering if anyone has
any bright ideas before I continue searching this out?

Thanks,
Jon
 
T

Tobin Harris

Jon Davis said:
- A "design feature" in the XmlNode's InnerText property that converts the
&###; encoding into an actual character.
- An encoding flaw when written to disk (currently I'm using the default,
UTF-8 I guess).
- A flaw in the FTP client class where the file is being corrupted during
upload (I think I'm using binary upload format but perhaps I should
double-check).
- A flaw in IIS (no known strange settings exist)

For starters I'd rule out the last two options - I think it's almost got to
be in character encoding or the way you're writing it to disk.

As you notice, if the source code text in your DHTML component is stored in
a different encoding to the format you're using to write to disk, then
you'll lose information, or it will be written incorrectly. Most encodings
store ascii characters upto 128 the same, so errors only become obvious
after 128.

I'd be interested to find out what encoding the DHTML control is using to
store its source code. UCS-2 is, as far as I'm aware, the standard windows
encoding, so you might want to try writing out to disk using this encoding
rather than UTF-8. The streamwriters let you set the encoding before
writing. Hopefully you'll not get any loss of information, which is what is
happening when you try to write UCS-2 as UTF-8! Just a guess, but worth a
try!?

HTH

Tobin
 
J

Jon Davis

Thanks Tobin. I'll check out UCS-2, et al.

Jon


Tobin Harris said:
For starters I'd rule out the last two options - I think it's almost got to
be in character encoding or the way you're writing it to disk.

As you notice, if the source code text in your DHTML component is stored in
a different encoding to the format you're using to write to disk, then
you'll lose information, or it will be written incorrectly. Most encodings
store ascii characters upto 128 the same, so errors only become obvious
after 128.

I'd be interested to find out what encoding the DHTML control is using to
store its source code. UCS-2 is, as far as I'm aware, the standard windows
encoding, so you might want to try writing out to disk using this encoding
rather than UTF-8. The streamwriters let you set the encoding before
writing. Hopefully you'll not get any loss of information, which is what is
happening when you try to write UCS-2 as UTF-8! Just a guess, but worth a
try!?

HTH

Tobin
 
M

Mihai N.

Hopefully you'll not get any loss of information, which is what is
happening when you try to write UCS-2 as UTF-8!

Unlikely. UCS2 or UTF8 are two different representations of the same
character set (Unicode). There is no loss of info when you convert from one
to the other (if the conversion is correctly done).

Unlikely. Even if the binary is not set, the only damaged characters will be
the control characters (below 0x20).
Most probable. As a test, add this to the html file, first one in the
<head> section:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">

Without it the browser will assume the default is iso-8859-1.

Mihai
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top