Creating ANSI text files with international characters

  • Thread starter Thread starter Lars-Erik Aabech
  • Start date Start date
L

Lars-Erik Aabech

Hi!

I haven't started this yet, and I wanted to get sure of few things first.
Hope anyone can help :)

What I want to do is to mail vCalendar files as attachments to users of our
program. I've got the vCalendar part right, except for the file encoding. It
needs to be ANSI, but System.IO.File.CreateText only creates UTF-8 encoded
files.

Would you give me your opinions on these points?
1. I presume I have to use System.IO.BinaryWriter and some encoding method
from System.Text.
2. I need to include line breaks and norwegian characters like æøå in the
description of the vCalendar - Will try replacing \r\n with \n, but am not
sure if æøå will be written correctly. (they aren't if I save an UTF-8
encoded vCalendar file with ANSI encoding in notepad)

Lars-Erik
 
Lars-Erik Aabech said:
I haven't started this yet, and I wanted to get sure of few things first.
Hope anyone can help :)

What I want to do is to mail vCalendar files as attachments to users of our
program. I've got the vCalendar part right, except for the file encoding.It
needs to be ANSI, but System.IO.File.CreateText only creates UTF-8 encoded
files.

What exactly do you mean by "ANSI" here? There are several ANSI code
pages. You need to know which one you should be using.
Would you give me your opinions on these points?
1. I presume I have to use System.IO.BinaryWriter and some encoding method
from System.Text.

No. Create a FileStream, and then a StreamWriter on top of that,
specifying the encoding. BinaryWriters are there to write mostly binary
files, not text files.

In fact, you can get StreamWriter to create the FileStream for you, if
you specify the appropriate constructor, eg

StreamWriter writer = new StreamWriter ("myfile.txt",
Encoding.Whatever);
2. I need to include line breaks and norwegian characters like æøå in the
description of the vCalendar - Will try replacing \r\n with \n, but am not
sure if æøå will be written correctly. (they aren't if I save an UTF-8
encoded vCalendar file with ANSI encoding in notepad)

If you specify the correct encoding, it'll be fine.
 
First of all... tnx a lot :) I'm really happy I asked..

I mean ANSI as in any ANSI - it works with any codepage ;)
But it is very relevant for my second point - and I believe the codepage for
norwegian is 865.

So far, I solved the problem by replacing
System.IO.StreamWriter sw = System.IO.File.CreateText(sFileName)
with
System.Text.Encoding enc = System.Text.Encoding.ASCII;
System.IO.StreamWriter sw = new System.IO.StreamWriter(sFileName, false,
enc)

Thus I'm extremely satisfied I didn't have to mess everything up by writing
byte arrays :)

I fixed the line break issue as I mentioned, but I'm at a loss on how to
specify the codepage. Do I have to make a derived class from
System.Text.Encoding? All the properties are get only, and the constructor
is protected :/

Lars-Erik

Lars-Erik Aabech said:
I haven't started this yet, and I wanted to get sure of few things first.
Hope anyone can help :)

What I want to do is to mail vCalendar files as attachments to users of our
program. I've got the vCalendar part right, except for the file encoding. It
needs to be ANSI, but System.IO.File.CreateText only creates UTF-8 encoded
files.

What exactly do you mean by "ANSI" here? There are several ANSI code
pages. You need to know which one you should be using.
Would you give me your opinions on these points?
1. I presume I have to use System.IO.BinaryWriter and some encoding method
from System.Text.

No. Create a FileStream, and then a StreamWriter on top of that,
specifying the encoding. BinaryWriters are there to write mostly binary
files, not text files.

In fact, you can get StreamWriter to create the FileStream for you, if
you specify the appropriate constructor, eg

StreamWriter writer = new StreamWriter ("myfile.txt",
Encoding.Whatever);
2. I need to include line breaks and norwegian characters like æøå in the
description of the vCalendar - Will try replacing \r\n with \n, but am not
sure if æøå will be written correctly. (they aren't if I save an UTF-8
encoded vCalendar file with ANSI encoding in notepad)

If you specify the correct encoding, it'll be fine.
 
Lars-Erik Aabech said:
First of all... tnx a lot :) I'm really happy I asked..

I mean ANSI as in any ANSI - it works with any codepage ;)

In that case you might as well just use ASCII, as I believe that's the
only part which is common to all ANSI code pages.

What exactly do you mean by "it works with any codepage"? *What* works
with any codepage?
But it is very relevant for my second point - and I believe the codepage for
norwegian is 865.

But what does the receiving program expect? If it doesn't know to use
that codepage, you'll still have problems.
So far, I solved the problem by replacing
System.IO.StreamWriter sw = System.IO.File.CreateText(sFileName)
with
System.Text.Encoding enc = System.Text.Encoding.ASCII;
System.IO.StreamWriter sw = new System.IO.StreamWriter(sFileName, false,
enc)

Thus I'm extremely satisfied I didn't have to mess everything up by writing
byte arrays :)

Good - but that'll only work for ASCII, not the extra characters you
wanted.
I fixed the line break issue as I mentioned, but I'm at a loss on how to
specify the codepage. Do I have to make a derived class from
System.Text.Encoding? All the properties are get only, and the constructor
is protected :/

If the receiving application is *really* expecting code page 865, then
use Encoding.GetEncoding(865).
 
In that case you might as well just use ASCII, as I believe that's the
only part which is common to all ANSI code pages.

What exactly do you mean by "it works with any codepage"? *What* works
with any codepage?

There's obviously a lot I don't know about encoding (although I'd like to).
Do you know some place on the net (not to academic) where I can learn more?
I though ANSI and ASCII were more or less the same 256 bytes set, except the
last x bytes represent different special characters depending on the
codepage specified.

Anyway - I'm creating a vCalendar file (http://www.imc.org/pdi/) which will
be mailed as an attachment to Outlook users (hopefully it will work with
other apps too). Outlook complains if the file isn't encoded correctly. So I
tried to open one of the generated files with notepad, saved it as ANSI
instead of UTF-8, and then it works. These are the facts I based my
statements on ;) (works, doesn't work, ansi etc)
But what does the receiving program expect? If it doesn't know to use
that codepage, you'll still have problems. -
Good - but that'll only work for ASCII, not the extra characters you
wanted. -
If the receiving application is *really* expecting code page 865, then
use Encoding.GetEncoding(865).

I'm getting closer at least..
I've tried the following, and all the types was accepted by outlook 2003,
with assorted presentations of the norwegian characters: (?, +, empty, etc
:) )

System.Text.Encoding enc = System.Text.Encoding.GetEncoding(865);
System.Text.Encoding enc = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding enc = System.Text.Encoding.GetEncoding(20127);
System.Text.Encoding enc = System.Text.Encoding.GetEncoding("iso-8859-1");

etc. etc.

Which means that outlook don't give a **** what codepage I use.

I've exported a calendar element from outlook with special characters (while
writing this post) and it appears I have to replace the special characters
with '=E6' etc. and insert some more parameters in the vCalendar file.
Example:
SUMMARY;ENCODING=QUOTED-PRINTABLE:V=E6ret er r=F8tent i =E5r =C6=D8=C5
instead of
SUMMARY:Været er røtent i år ÆØÅ

So, the last question I have would be... Anyone got a magic way to do this
or do I have to do string.replace("æ", "=E6").replace("ø", "=xx").... ???
(Maybe a loop using String.charCodeAt or such, but still....)

Lars-Erik
 
Lars-Erik Aabech said:
There's obviously a lot I don't know about encoding (although I'd like to).
Do you know some place on the net (not to academic) where I can learn more?

See http://www.pobox.com/~skeet/csharp/unicode.html - that's my best
explanation, and it's got some other things in as well.
I though ANSI and ASCII were more or less the same 256 bytes set, except the
last x bytes represent different special characters depending on the
codepage specified.

ASCII is only 7-bit to start with.

Different ANSI code pages tend to share the first 128 values with
ASCII, and then have different values for the last 128 values. That's
what I mean when I say there's no such thing as "the ANSI encoding".
Anyway - I'm creating a vCalendar file (http://www.imc.org/pdi/) which will
be mailed as an attachment to Outlook users (hopefully it will work with
other apps too). Outlook complains if the file isn't encoded correctly. So I
tried to open one of the generated files with notepad, saved it as ANSI
instead of UTF-8, and then it works. These are the facts I based my
statements on ;) (works, doesn't work, ansi etc)

Looking at the specification, it seems Outlook is being a little too
generous, but that there's a way you can get round it anyway. From the
spec, section 2.1.5:

<quote>
The default character set is ASCII. The default character set can be
overridden for an individual property value by using the "CHARSET"
property parameter. This property parameter may be used on any
property. However, the use of this parameter on some properties may not
make sense.
Any character set registered with the Internet Assigned Numbers
Authority (IANA) can be specified by this property parameter. For
example, ISO 8859-8 or the Latin/Hebrew character set is specified by:
DESCRIPTION;CHARSET=ISO-8859-8:...
Some transports (e.g., MIME based electronic mail) may also provide a
character set property at the transport wrapper level. This property
can be used in these cases for transporting a vCalendar data stream
that has been defined using a default character set other than ASCII
(e.g., UTF-8).
</quote>

I would suggest that you should output ASCII without any CHARSET= tag
where there are no non-ASCII characters, and use UTF-8 otherwise,
specifying CHARSET=UTF-8.

I would certainly *hope* that would work.

Note section 2.1.4, however, which specifies the encoding for the whole
object - it defaults to only 7 bit.
I'm getting closer at least..
I've tried the following, and all the types was accepted by outlook 2003,
with assorted presentations of the norwegian characters: (?, +, empty, etc
:) )

System.Text.Encoding enc = System.Text.Encoding.GetEncoding(865);
System.Text.Encoding enc = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding enc = System.Text.Encoding.GetEncoding(20127);
System.Text.Encoding enc = System.Text.Encoding.GetEncoding("iso-8859-1");

etc. etc.

Which means that outlook don't give a **** what codepage I use.

It must, because you're *potentially* creating different data. What you
might have seen is either Outlook guessing (which means it might guess
it wrong) or you picking encodings which use the same mappings for
those particular characters.
I've exported a calendar element from outlook with special characters (while
writing this post) and it appears I have to replace the special characters
with '=E6' etc. and insert some more parameters in the vCalendar file.
Example:
SUMMARY;ENCODING=QUOTED-PRINTABLE:V=E6ret er r=F8tent i =E5r =C6=D8=C5
instead of
SUMMARY:Været er røtent i år ÆØÅ

That would be due to using quoted printable, as specified in section
2.1.4.
So, the last question I have would be... Anyone got a magic way to do this
or do I have to do string.replace("æ", "=E6").replace("ø", "=xx")..... ???
(Maybe a loop using String.charCodeAt or such, but still....)

Basically you'd want to look through the created byte array, and any
byte greater than 127 should be quoted - along with '=' presumably (I
haven't checked the quoted printable spec for a while).
 
OK, First of all, thnx for reading the spec for me ;) *a little ashamed*

I'll just recap and give you my new status / interpretation of my
restrictions.

First of all, I've been using the wrong file extention for vCalendar files,
I've been using the extention for iCalendar files, which can be encoded with
UTF-8. Now that I've changed to the vCalendar format, only different ANSI
codepages is accepted, but are apparently read as ASCII. (am I at least
getting better at this? ;) )

Anyway - using iso-8859-1 encoding with codepage 1252 which is the
encoding/codepage my outlook uses when exporting to .vcf files, and set the
encoding parameter to quoted-printable for the summary/description property
of the vCalendar object - I'm able to use =0D, =3A etc. for special
characters, but =E5 (å) is stripped when I try to open the file in outlook.

There's no apparent relevant difference between a file outlook exports and
reads perfectly with æøå in it, and the files I generate. Both use
iso-8859-1 encoding with codepage 1252. This is how they look:

Outlook's (displayed correctly):
BEGIN:VCALENDAR
PRODID:-//Microsoft Corporation//Outlook 11.0 MIMEDIR//EN
VERSION:1.0
BEGIN:VEVENT
DTSTART:20040310T070000Z
DTEND:20040310T080000Z
UID:[email protected]
DESCRIPTION;ENCODING=QUOTED-PRINTABLE:M=E5l- og resultatsamtale mellom Lars=
-Erik Aabech og Lars-Erik Aabech=0D=0ATid: 10.03.2004 08:00=0D=0ASam=
taletype: Resultatsamtale=0D=0AKanskje dette funker..=0D=0A
SUMMARY;ENCODING=QUOTED-PRINTABLE:Invitasjon til m=E5l- og resultatsamtale
PRIORITY:3
END:VEVENT
END:VCALENDAR

Mine:
BEGIN:VCALENDAR
VERSION:2.0
METHOD:PUBLISH
BEGIN:VEVENT
UID:[email protected]
LOCATION:
DTSTART:20040310T070000Z
DTEND:20040310T080000Z
DTSTAMP:20040303T134702Z
SUMMARY;ENCODING=QUOTED-PRINTABLE:Invitasjon til m=E5l- og resultatsamtale
DESCRIPTION;ENCODING=QUOTED-PRINTABLE:M=E5l- og resultatsamtale mellom
Lars-Erik Aabech og Lars-Erik Aabech=0DTid: 10.03.2004 08:00=0DSamtaletype:
Resultatsamtale=0D=0DKanskje dette funker..
CLASS:PUBLIC
END:VEVENT
END:VCALENDAR

So, I'm at a complete loss as far as vCalendar files go.

But I found out that iCalendar and vCalendar files use appx. the same syntax
(although I have to admit I haven't read the specs good enough to describe
the differences), and if I export an iCalendar file from outlook it is
encoded with UTF-8 using codepage 65001 - æøå is saved as plain text, and
line-shifts are saved as \n in plain text :)
So, for now I'm gonna change encoding, codepage, syntax & file extention and
pray iCalendar is easier than vCalendar :)

Thanks a lot for your help, Jon! I've learnt a lot today, although I'm
partly giving up :)

Lars-Erik

-

BTW.. here's the file from outlook in iCalendar format :D

BEGIN:VCALENDAR
PRODID:-//Microsoft Corporation//Outlook 11.0 MIMEDIR//EN
VERSION:2.0
METHOD:PUBLISH
BEGIN:VEVENT
DTSTART:20040310T070000Z
DTEND:20040310T080000Z
TRANSP:OPAQUE
SEQUENCE:0
UID:[email protected]
DTSTAMP:20040303T134702Z
DESCRIPTION:Mål- og resultatsamtale mellom Lars-Erik Aabech og Lars-Erik
Aabech\nTid: 10.03.2004 08:00\nSamtaletype: Resultatsamtale\n\nKanskje
dette funker..\n
SUMMARY:Invitasjon til mål- og resultatsamtale
PRIORITY:5
X-MICROSOFT-CDO-IMPORTANCE:1
CLASS:PUBLIC
END:VEVENT
END:VCALENDAR



Lars-Erik Aabech said:
There's obviously a lot I don't know about encoding (although I'd like to).
Do you know some place on the net (not to academic) where I can learn
more?

See http://www.pobox.com/~skeet/csharp/unicode.html - that's my best
explanation, and it's got some other things in as well.
I though ANSI and ASCII were more or less the same 256 bytes set, except the
last x bytes represent different special characters depending on the
codepage specified.

ASCII is only 7-bit to start with.

Different ANSI code pages tend to share the first 128 values with
ASCII, and then have different values for the last 128 values. That's
what I mean when I say there's no such thing as "the ANSI encoding".
Anyway - I'm creating a vCalendar file (http://www.imc.org/pdi/) which will
be mailed as an attachment to Outlook users (hopefully it will work with
other apps too). Outlook complains if the file isn't encoded correctly. So I
tried to open one of the generated files with notepad, saved it as ANSI
instead of UTF-8, and then it works. These are the facts I based my
statements on ;) (works, doesn't work, ansi etc)

Looking at the specification, it seems Outlook is being a little too
generous, but that there's a way you can get round it anyway. From the
spec, section 2.1.5:

<quote>
The default character set is ASCII. The default character set can be
overridden for an individual property value by using the "CHARSET"
property parameter. This property parameter may be used on any
property. However, the use of this parameter on some properties may not
make sense.
Any character set registered with the Internet Assigned Numbers
Authority (IANA) can be specified by this property parameter. For
example, ISO 8859-8 or the Latin/Hebrew character set is specified by:
DESCRIPTION;CHARSET=ISO-8859-8:...
Some transports (e.g., MIME based electronic mail) may also provide a
character set property at the transport wrapper level. This property
can be used in these cases for transporting a vCalendar data stream
that has been defined using a default character set other than ASCII
(e.g., UTF-8).
</quote>

I would suggest that you should output ASCII without any CHARSET= tag
where there are no non-ASCII characters, and use UTF-8 otherwise,
specifying CHARSET=UTF-8.

I would certainly *hope* that would work.

Note section 2.1.4, however, which specifies the encoding for the whole
object - it defaults to only 7 bit.
I'm getting closer at least..
I've tried the following, and all the types was accepted by outlook 2003,
with assorted presentations of the norwegian characters: (?, +, empty, etc
:) )

System.Text.Encoding enc = System.Text.Encoding.GetEncoding(865);
System.Text.Encoding enc = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding enc = System.Text.Encoding.GetEncoding(20127);
System.Text.Encoding enc = System.Text.Encoding.GetEncoding("iso-8859-1");

etc. etc.

Which means that outlook don't give a **** what codepage I use.

It must, because you're *potentially* creating different data. What you
might have seen is either Outlook guessing (which means it might guess
it wrong) or you picking encodings which use the same mappings for
those particular characters.
I've exported a calendar element from outlook with special characters (while
writing this post) and it appears I have to replace the special characters
with '=E6' etc. and insert some more parameters in the vCalendar file.
Example:
SUMMARY;ENCODING=QUOTED-PRINTABLE:V=E6ret er r=F8tent i =E5r =C6=D8=C5
instead of
SUMMARY:Været er røtent i år ÆØÅ

That would be due to using quoted printable, as specified in section
2.1.4.
So, the last question I have would be... Anyone got a magic way to do this
or do I have to do string.replace("æ", "=E6").replace("ø", "=xx").... ???
(Maybe a loop using String.charCodeAt or such, but still....)

Basically you'd want to look through the created byte array, and any
byte greater than 127 should be quoted - along with '=' presumably (I
haven't checked the quoted printable spec for a while).
 
Of course, it didn't work - I'm gonna compare the files with a hex editor
and see what the difference is, and move the thread to pub.outlook something
instead. (since I got the .net specific part all right :) )

Again - tnx for the help :)

L-E
 
It works!!!!

I ended up NOT using any encoding. iCalendar files are unsigned UTF-8 files,
so I'm back at the start :)
What a ride...

Lars-Erik
 
Lars-Erik Aabech said:
OK, First of all, thnx for reading the spec for me ;) *a little ashamed*

I'll just recap and give you my new status / interpretation of my
restrictions.

First of all, I've been using the wrong file extention for vCalendar files,
I've been using the extention for iCalendar files, which can be encoded with
UTF-8. Now that I've changed to the vCalendar format, only different ANSI
codepages is accepted, but are apparently read as ASCII. (am I at least
getting better at this? ;) )

Not really sure what you mean by "read as ASCII"...
Anyway - using iso-8859-1 encoding with codepage 1252

Hang on - ISO-8859-1 is one encoding, and codepage 1252 is a different
one. (The only differ for about 16 characters, but they *are* different
things.) The don't use an encoding "with" a codepage - a codepage *is*
an encoding.
which is the
encoding/codepage my outlook uses when exporting to .vcf files, and set the
encoding parameter to quoted-printable for the summary/description property
of the vCalendar object - I'm able to use =0D, =3A etc. for special
characters, but =E5 (å) is stripped when I try to open the file in outlook.

I suggest you try specifying the CHARSET as per the spec - it looks
like the generated ones don't, but if you do it's likely to help.

<snip>
 
Lars-Erik Aabech said:
It works!!!!

I ended up NOT using any encoding. iCalendar files are unsigned UTF-8 files,
so I'm back at the start :)
What a ride...

Hang on though - again, you're not quite making sense: UTF-8 isn't
unsigned or signed, but it *is* an encoding itself...
 
Hehe.. I'm getting confused by all the issues with text :)
Not really sure what you mean by "read as ASCII"...

Standard characters below 127 was rendered, the rest was not..
Hang on - ISO-8859-1 is one encoding, and codepage 1252 is a different
one. (The only differ for about 16 characters, but they *are* different
things.) The don't use an encoding "with" a codepage - a codepage *is*
an encoding.

I was obviously assuming too much again. I opened the files in Visual Studio
and looked at the selected encoding/codepage for each file in the
File->Advanced Save Options dialog, it said Western European codepage 1252,
and I assumed that was the same as iso-8859-1 with codepage 1252.. I used
GetEncoding("iso-8859-1") to generate the files, so I assumed they were the
same. :)

Umm.. since you made me clear on that very influential encoding/codepage
point; yet another light appeared over the text mysterium.. (I hope) Would
that mean the encoding UTF-8 IS the codepage 65001!?!?

I still haven't read that article of yours, but I will!
I suggest you try specifying the CHARSET as per the spec - it looks
like the generated ones don't, but if you do it's likely to help.

I did, it didn't :/

And to answer your question for my other post, i applied the same method to
look at the encoding for the files that worked, and VS. either displayed
Unicode (UTF-8 with signature) codepage 65001, or Unicode (UTF-8 without
signature) codepage 65001. I didn't notice the signature difference at
first, so I checked the files with a hex editor, and noticed the first three
bytes of the file that didn't work - a signature :)
The resulting code was
StreamWriter sw = new StreamWriter(sFileName, false);
et voilà :) no signature, UTF-8, and all special chars are displayed. (after
changing extention to iCalendar of course)

I might dig into the vCalendar stuff again, but the deadlines on my project
say I can't right now.. Barely made it by the time I had planned ;)

And again, tnx for the help :D

L-E
 
Lars-Erik Aabech said:
Hehe.. I'm getting confused by all the issues with text :)


Standard characters below 127 was rendered, the rest was not..

Right.
I was obviously assuming too much again. I opened the files in Visual Studio
and looked at the selected encoding/codepage for each file in the
File->Advanced Save Options dialog, it said Western European codepage 1252,
and I assumed that was the same as iso-8859-1 with codepage 1252.. I used
GetEncoding("iso-8859-1") to generate the files, so I assumed they were the
same. :)

Nope. 8859-1 is the same as 1252 apart from between (IIRC) 128 and 140,
where they differ.
Umm.. since you made me clear on that very influential encoding/codepage
point; yet another light appeared over the text mysterium.. (I hope) Would
that mean the encoding UTF-8 IS the codepage 65001!?!?

According to http://www.sharmahd.com/tm/codepages.html, you're right.
I still haven't read that article of yours, but I will!

Hope it helps.
I did, it didn't :/

And to answer your question for my other post, i applied the same method to
look at the encoding for the files that worked, and VS. either displayed
Unicode (UTF-8 with signature) codepage 65001, or Unicode (UTF-8 without
signature) codepage 65001. I didn't notice the signature difference at
first, so I checked the files with a hex editor, and noticed the first three
bytes of the file that didn't work - a signature :)
The resulting code was
StreamWriter sw = new StreamWriter(sFileName, false);
et voilà :) no signature, UTF-8, and all special chars are displayed. (after
changing extention to iCalendar of course)
Great!

I might dig into the vCalendar stuff again, but the deadlines on my project
say I can't right now.. Barely made it by the time I had planned ;)

:)
 
Back
Top