.NET 2.0 - Sending Emails - Subject and Attachment Name Encoding Issues

  • Thread starter Thread starter Shan Plourde
  • Start date Start date
S

Shan Plourde

Hi there,
I have an e-commerce website that sends automated emails that contain
an automatically generated PDF attachment. It's similar to this email
sample:

------
Email Sample:
Subject: <Name> just completed their assessment
Attachment: <Name> assessment results.PDF
Body: This is a friendly alert that <Name> just completed their
assessment...
------

The system correctly stores <Name> in a SQL Server nvarchar database
column. Some names stored use Latin characters, while others use
Chinese characters. Notably the issue that I've just recently found is
that the automatically generated emails are not showing <Name>
correctly if the person's name happens to have Chinese characters in
it.

Here is the code that is used to create and send the emails:

------
SmtpClient client = new SmtpClient();
MailAddress from = new MailAddress(this.emailFrom);

MailAddress to = new MailAddress(this.emailTo,
String.Format("{0} {1}", this.firstName,
this.lastName));

MailMessage msg = new MailMessage(from, to);

if (this.attachment != null)
{
this.attachment.NameEncoding =
System.Text.Encoding.UTF8;
msg.Attachments.Add(this.attachment);
}

msg.SubjectEncoding = System.Text.Encoding.UTF8;

// The line below sets the subject to something like:
// "<Name> just completed their assessment"
msg.Subject =
templateManager.ProcessFileTemplate("AssessmentComplete.Subject.vm",
context);

msg.Body = templateManager.ProcessFileTemplate(
"AssessmentComplete.vm", context);

msg.BodyEncoding = System.Text.Encoding.UTF8;
client.Send(msg);
------

I built this awhile back assuming that since I was setting all email
encodings to UTF8, the email system would work for various languages.
Unfortunately that's not the case. When an automatically generated
email is sent where the person's name contains Chinese characters, the
subject will typically read:

"??? ?? ??? just completed their assessment"

The email body however correctly displays the Chinese name. The PDF
file attachment name will be named:
"??? ?? ??? assessment results.PDF"

msg.Subject does write to console and shows in the debugger correctly
with the Chinese characters, as does the body and file attachment
name. So the storage of the Chinese characters is correct, as is the
retrieval into .NET String objects.

The error seems to be happening during the transport of the email I
suspect. When I hard-code the subject and attachment encodings to
something such as System.Text.Encoding.GetEncoding("gb2312"), which is
Simplified Chinese, the Chinese characters display correctly.

Why wouldn't UTF8 encoding work though? Also, I'm not able to simply
hardcode an encoding of "gb2312" as the subject and file attachment
encoding - what happens if names are stored in other languages? This
would clearly fail. Maybe there's a way to guess at the encoding of
a .NET string, but I'm not aware of one. Shouldn't UTF8 work in the
first place?

What do people normally do to handle this? Should I create email
subject lines with Unicode escape codes for all characters? If so, is
there an out of the box approach to do this?

Confused!

Thanks for your help,
Shan
 
The error seems to be happening during the transport of the email I
suspect.

Can you first verify that the text is in fact corrupt (by, for
example, sending mail elsewhere) and that it's not just your e-mail
reader that's breaking things?
 
Can you first verify that the text is in fact corrupt (by, for
example, sending mail elsewhere) and that it's not just your e-mail
reader that's breaking things?

Hi UL-Tomten - I verified this issue with the following email clients:
Gmail, Outlook 2007, Outlook 2003. I also tried forwarding emails back
and forth to these clients and the same issue was occurring. Each
client demonstrates the same issue - the Chinese characters display as
question marks, and the rest of the email's Latin characters display
without any issues. And again, it's the email subject and file
attachment name that demonstrate this issue. Chinese characters in the
email body are fine.
 
Here are some sample Chinese Characters that can cause this issue to
occur (just random characters that I'm using for testing purposes):
 
Here are some sample Chinese Characters that can cause this issue to
occur (just random characters that I'm using for testing purposes):

The Chinese characters I tried to show just now are not appearing here
on this newgroup when I post through Google Groups. Anyhow, it doesn't
seem to matter what the Characters are that are used, if they are any
Chinese characters then the issue occurs.
 
The Chinese characters I tried to show just now are not appearing here
on this newgroup when I post through Google Groups. Anyhow, it doesn't
seem to matter what the Characters are that are used, if they are any
Chinese characters then the issue occurs.

Your sample code works as it should when I run it. Could you try the
following:

1. Instead of using the templateManager, try hard-coding a string with
Chinese characters into your code
2. Send the mail to gmail, and click "options" and then "show
original" and tell us what the "subject" line says (it should be
something along the lines of "=?utf-8?B?....==?=");
 
Your sample code works as it should when I run it. Could you try the
following:

1. Instead of using the templateManager, try hard-coding a string with
Chinese characters into your code
2. Send the mail to gmail, and click "options" and then "show
original" and tell us what the "subject" line says (it should be
something along the lines of "=?utf-8?B?....==?=");

Hi UL-Tomten, thanks for following up. Actually I lied! I only tested
this with Outlook 2007 at first. After testing --- and I did try
everything that you suggested to isolate stuff, and the subject did
indeed only contain Chinese characters during a debugging session
where I was using the debugger to change message properties and send
to various email addresses --- I have found the following with emails
with Chinese characters in the subject, body and file attachment name:

1. When the email is sent to Gmail and Yahoo email addresses and
viewed with their web viewers, all Chinese characters show correctly
2. When the email is sent to a Microsoft Exchange Server email address
at my company and viewed in my Outlook 2007 client, the subject and
file attachment name show question marks, but the body shows Chinese
Characters
3. When the email is sent to a Gmail email address and viewed with an
Outlook 2007 configured email client that retrieves Gmail mail using
pop3, all Chinese characters show correctly!
4. When the email is sent to a Microsoft Exchange Server email address
at my company and viewed in my company's Microsoft webmail interface,
the same issue - the question marks - again happens
5. Here's an interesting one - when the email is sent to my wife's
work email, which is also Microsoft Exchange Server, her Outlook 2007
correctly shows all Chinese characters!
6. A colleague from another company, which also uses Microsoft
Exchange Server, received the email successfully with the same issue
that I have

I'm not sure what that means then. To summarize though, it's only my
work email that the issue is happening with. Could it be possible that
perhaps then there's some sort of Exchange Server problem? I am not
really sure right now, but interested to know if you may have any
ideas.

Thanks
Shan
 
Could it be possible that perhaps then there's some sort of Exchange
Server problem? I am not really sure right now, but interested to
know if you may have any ideas.

I've had these problems myself, in which case I think Outlook 2000 was
the problem. My guess at the time was that the UI control that
rendered the subject didn't support Unicode and/or Uniscribe. To
render the Chinese characters, a different font has to be used, so it
may even boil down to a client setting problem (if you choose a weird
font in Outlook, it could potentially break things, I'm not sure). But
since it works for you on the same computer using the same Outlook
2007 with POP3, but not Exchange (right?), that's not likely to be the
problem anymore.

Either way, the e-mails produced by your code are perfectly valid, and
they should work. Perhaps changing some settings in regard to e.g. the
transport encoding might help, but that sounds like an unreliable
workaround.

I haven't been able to reproduce these problems on XPSP2 using any
combination of Gmail, IE7, Firefox 2, OWA2007 and Thunderbird 2, and I
don't have an Outlook handy, so I'm not sure how to help you further.
But what I would do is try to track down the exact location of the
problem, both in the Web case and the Outlook case.

So, in the Web case: can you verify it's not an encoding problem in
the browser? Could you try a different browser? If you use Fiddler 2
to inspect the traffic from the web server, can you verify that the
Chinese characters are question marks before they reach your browser?
Or can you perhaps "View source" using a reliable source viewer?

In the Outlook case: If you open a corrupt message and go to View ->
Options -> Internet headers (I think that's where it was in Outlook
2000 at least), you should be able to see the un-decoded headers,
including the Subject field. Can you see what it says there?
 
I haven't been able to reproduce these problems [...]

Regarding the Web case; I've tested one OWA 2003 installation, and the
simple client renders its pages as Western European, which means
Chinese characters are lost. The rich client works as expected, and
renders its pages as Unicode. I've also tested an OWA 2007
installation, where both the simple and rich clients worked. This
might be a server-side setting; I have a feeling a Chinese Windows
Server installation would not render as Western European by default. I
also have a feeling the simple client in OWA 2003 doesn't render as
Unicode because UTF-8 support in browsers wasn't as good five years
ago as it is now. Either way, not related to the Framework or your
code.
 
I haven't been able to reproduce these problems [...]

Regarding the Web case; I've tested one OWA 2003 installation, and the
simple client renders its pages as Western European, which means
Chinese characters are lost. The rich client works as expected, and
renders its pages as Unicode. I've also tested an OWA 2007
installation, where both the simple and rich clients worked. This
might be a server-side setting; I have a feeling a Chinese Windows
Server installation would not render as Western European by default. I
also have a feeling the simple client in OWA 2003 doesn't render as
Unicode because UTF-8 support in browsers wasn't as good five years
ago as it is now. Either way, not related to the Framework or your
code.

Hi UL - You're absolutely right. I was testing yesterday with our IT
director and he was analyzing the raw incoming data into our Microsoft
Exchange Server mail server - the raw message subject coming in was
indeed UTF-8 encoded, but something in the email server processing
pipeline was stripping out some of the stream's characters, and
leaving the encoding as UTF-8. The net result is whatever that server
side process in the pipeline is destroying the subject, specifically
if the encoding is UTF-8. If I set the encoding to a specific
encoding, i.e. such as "gb2312" to handle Chinese characters within a
subject line, then the server side process keeps the subject stream
intact. As of now he wasn't sure what in the processing pipeline was
causing the issue but he was still investigating.

I'm guessing that if I set the encoding specifically like this, it
will also help to decrease the chances of other users using this
service experiencing the same issue - some of their mail servers might
also be destroying the subject in their processing pipelines.

Unfortunately that means that I have to do a bit of refactoring to the
message sending code to not simply make it UTF-8. Since the website
operates in a finite number of languages, it should be somewhat safe
to set the encoding based on the language that the website was used by
a given user at the time that they completed a self-assessment. Of
course it won't work if someone enters their name containing say
Chinese and Japanese characters, but in reality this should never be
the case.

I wish I could just use UTF-8 for everything, it is my preference, but
unfortunately it won't work as reliably as setting the encoding.

Interested to hear if you may have found the same issue with Exchange
servers or other mail servers.

Thanks again,
Shan
 
intact. As of now he wasn't sure what in the processing pipeline was
causing the issue but he was still investigating.

Please post the findings here eventually. It will bring some nice
closure to my bad experiences of yore.
message sending code to not simply make it UTF-8. Since the website
operates in a finite number of languages, it should be somewhat safe
to set the encoding based on the language that the website was used by

I think you'll have to challenges here:

1. Finding a suitable encoding for each language that is compatible
with .NET as well as users' e-mail clients and web browsers.

2. Coping with only having access to the encoding-specific characters
and ASCII.

You might want to make a note of which, if any, characters are lost
when you encode your messages as non-Unicode encodings: decode the
encoded string and compare it with the original string, or write your
own encoder fallback class that informs you of character fallbacks.
 
indeed UTF-8 encoded, but something in the email server processing
pipeline was stripping out some of the stream's characters, and
leaving the encoding as UTF-8. The net result is whatever that server

I'm guessing that Exchange is configured to somehow alter and/or
inspect subject lines, and does not support all the characters in
UTF-8, thus silently mangling subject lines upon re-encoding. Perhaps
there is a missing service pack somewhere, or there is a requirement
on "asian text support" being installed which is not satisfied.

I'm guessing there is a better newsgroup for this discussion now than
this one...
 
I'm guessing that Exchange is configured to somehow alter and/or
inspect subject lines, and does not support all the characters in
UTF-8, thus silently mangling subject lines upon re-encoding. Perhaps
there is a missing service pack somewhere, or there is a requirement
on "asian text support" being installed which is not satisfied.

I'm guessing there is a better newsgroup for this discussion now than
this one...

Thanks UL-Tomten. I have started a new thread at
http://groups.google.ca/group/micro...8febc4d7815/3f948dbbb4e23f87#3f948dbbb4e23f87
in case you're curious to follow it. The IT director of my company has
found nothing yet so I figured I'd just query other admins on the
Exhange Server groups to see what they may have found.

As far as your app dev concerns go, I don't see them being a
challenge. They should help to increase the likelihood of success. For
example, the website that hosts the automated email features a test
that can be taken in many languages. If a user opts to take the test
in say Simplified Chinese, then I can fairly safely make assumptions
that the test taker will either enter their name with standard Latin
characters, or perhaps enter their name with Simplified Chinese
characters, although the probability of them taking the test in
Chinese, yet having say a different multi-byte name entered, perhaps
Korean or Japanese, is extremely low.

So in the case of Chinese, an encoding such as "gb2312" should be a
somewhat safe solution. Should users input Latin characters, those
Latin characters will be fine, as will the Chinese characters - or at
least the Chinese characters will be viewable on more email clients
versus going with a UTF-8 encoding.

It is a pain, but a very worthwhile discovery, especially considering
all of the email examples out there today in the .NET world that
simply say "go with UTF-8" - even the examples from Microsoft!
 
that can be taken in many languages. If a user opts to take the test
in say Simplified Chinese, then I can fairly safely make assumptions
that the test taker will either enter their name with standard Latin
characters, or perhaps enter their name with Simplified Chinese

Yeah, I agree. I see only two non-crucial concrete issues:

1. You have no way to recover from encoding failures. The best you can
do is include a warning to the recipient that the text may have (or
has, if you check) lost some characters.

2. While "latin" characters are present in all local encodings,
Unicode does not restrict latin to ASCII. Wérnêr vòn Schïnkélknüß is a
name written in latin characters, but which part of it can be
represented in gb2312? Maybe all, I don't know.

Thanks for taking such a thorough approach in tracking down the
problem!
 
Yeah, I agree. I see only two non-crucial concrete issues:

1. You have no way to recover from encoding failures. The best you can
do is include a warning to the recipient that the text may have (or
has, if you check) lost some characters.

2. While "latin" characters are present in all local encodings,
Unicode does not restrict latin to ASCII. Wérnêr vòn Schïnkélknüß is a
name written in latin characters, but which part of it can be
represented in gb2312? Maybe all, I don't know.

Thanks for taking such a thorough approach in tracking down the
problem!

True UL-Tomten about your point with extended Latin characters, but in
this particular case with this website, I expect that type of scenario
to occur much less regularly than the issue that is currently
happening where someone who reads Simplified Chinese also writes their
name in it, or writes their name with basic Latin characters. Database
monitoring will reveal what the percentages look like.

I have implemented a mapping logic on the website now which defaults
to the gb2312 encoding within automated emails if the assessment is
taken in Simplified Chinese. For the sake of testing, I tried
specifying a first name containing Simplified Chinese characters, and
a last name of Schïnkélknüß. Wouldn't you know that the extended Latin
characters didn't render successfully in the email? Hehe, oh well,
I'll play the game of percentages here though - it's safer in my
opinion to opt for a gb2312 encoding as the default if the test is
taken in Simplified Chinese. Of course, I'm very frustrated that
certain email servers seem to not be able to handle UTF-8 very well,
and I wouldn't be doing this if so!

I like your suggestion about placing a warning in these emails about
character legibility - thanks for that. I'll incorporate that
generically into the emails.

If I do find anything else I'll post here. Hopefully this information
comes in handy for others that may run into similar issues.

Thanks
Shan
 
Back
Top