Filename Encoding Help

  • Thread starter Thread starter Adhal
  • Start date Start date
A

Adhal

Hello,
On Vista & XP, I want to store filenames in a text file. What encoding
should I use?

UTF16 (Encoding.Unicode)
OR
UTF32 (Encoding.UTF32).

I think UTF16 is enough? Anyway that is what I am currently using.

One other question. If I have a Japanese system and I again I want to store
the filenames in a text file but this time not Unicode, should I use:

A) ANSI (Encoding.Default)
B) ASCII (Encoding.ASCII)

My understanding is that I should use Encoding.Default as it is set
according to the system. I really don't know the difference between the two
as they seem alike.

Thanks for taking the time to read this.
Adhal
 
Adhal said:
Hello,
On Vista & XP, I want to store filenames in a text file. What encoding
should I use?

UTF16 (Encoding.Unicode)
OR
UTF32 (Encoding.UTF32).

I think UTF16 is enough? Anyway that is what I am currently using.

You can use any unicode encoding, like UTF-7, UTF-8, UTF-16 or UTF-32.

I suggest UTF-8, it's the most efficient for regular text, and it's the
default for all methods reading and writing text files in .NET.
One other question. If I have a Japanese system and I again I want to
store the filenames in a text file but this time not Unicode, should I use:

A) ANSI (Encoding.Default)
B) ASCII (Encoding.ASCII)

My understanding is that I should use Encoding.Default as it is set
according to the system. I really don't know the difference between the
two as they seem alike.

That depends on what characters the file names contains. The ASCII
encoding only handles characters with character codes from 32 to 127.
The ANSI character set will handle any characters in the ASCII character
set.

It also depends on what you are going to use the file for. Is there any
other program that will read the file?
 
Thanks Göran,
You can use any unicode encoding, like UTF-7, UTF-8, UTF-16 or UTF-32.

I suggest UTF-8, it's the most efficient for regular text, and it's the
default for all methods reading and writing text files in .NET.

Basically this program stores filenames & and other file details and it is
going to be used only on windows XP and Vista. I want to support all
languages that the filenames are capable of.

The problem is I am almost certain that Windows XP stores filenames in
UTF-16 but I am not sure what Windows Vista does. I don't want to use UTF-32
if I don't need it, as increases the file size unnecessarily. :-?
That depends on what characters the file names contains. The ASCII
encoding only handles characters with character codes from 32 to 127. The
ANSI character set will handle any characters in the ASCII character set.

It also depends on what you are going to use the file for. Is there any
other program that will read the file?

Ok this one is a bit more puzzling for me. Again it is a program that stores
file names on Windows XP and Windows Vista into a text file, however the
file can be opened by Windows 9x/me.

Give you an example is best approach. I have Windows XP (Japanese) and I
store the filenames in an ANSI text file. Now I take the file and open it up
in Windows 98 (Japanese). I would expect the text file to open fine and all
the characters appear fine. Am I right in my thinking?

My understanding is this works if I store it as ANSI but not as ASCII. I
haven't got Japanese Windows 98 to test this out. :(

Appreciate the help :)
 
Thanks Pete,
Great advice.
The difference between ANSI and ASCII should be negligible with respect
to dealing with MBCS or Unicode, since neither of the latter can be
encoded in the former.

Pete

The second is not Unicode but system depended ANSI/ASCII. Japanese Windows 9x notepad should be able
to open all text files written in ASNSI. My thinking that if I save it as ANSI it should be fine but
ASCII would fail. I do not really know I am just guessing here.
 
UTF16 (Encoding.Unicode)
OR
UTF32 (Encoding.UTF32).

UTF-8, UTF-16, and UTF-32 are all equivalent, they all cover the same range.
UTF-32 is not very unusual for storage.
UTF-8 is in general recomended for storage, and it is a good option for
cross-platform. But if you only have to read/write it on Windown, UTF-16
is the best fit. All Windows Unicode API and .NET API use UTF-16LE
(Little Endian)


I think UTF16 is enough? Anyway that is what I am currently using.

It's enough.


One other question. If I have a Japanese system and I again I want to store
the filenames in a text file but this time not Unicode, should I use:

A) ANSI (Encoding.Default)
B) ASCII (Encoding.ASCII)

You can use UTF-16 for Japanese without any problem, unless you
have to run on Win 95/98/Me.

Even then, if the file is only read/writen by your application,
you can convert to/from Shift-JIS (or cp932), the code page used
for Windows Japanese on Win 9x.

..NET always uses UTF-16 for processing, no matter if it runs on Japanese,
Chinese, Russian or English systems.
 
You can use UTF-16 for Japanese without any problem, unless you
have to run on Win 95/98/Me.

Even then, if the file is only read/writen by your application,
you can convert to/from Shift-JIS (or cp932), the code page used
for Windows Japanese on Win 9x.

.NET always uses UTF-16 for processing, no matter if it runs on Japanese,
Chinese, Russian or English systems.


Thanks for the info. I think I caused a bit of confusion. There are two output formats that are
needed. One in Unicode the other in either ANSI/ASCII.

The ANSI file output is not going to be opened with my program if it goes on windows 9x but with
notepad. So it can't be Unicode.

My understanding is when saving in ANSI it stores it according to system regional settings. Which
seems like best option. So if I have Windows XP Japanese and store the file as ANSI, I would think I
would not have any issue when I open it up in Windows 98 Japanese. The characters should all be
correct (I think).

:)
 
Mihai said:
All Windows Unicode API and .NET API use UTF-16LE
(Little Endian)

..NET uses UTF-16 for strings in memory, but all methods handling text
files uses UTF-8 by default.
It's enough.

That's somewhat misleading, as all unicode encodings (UTF-7, UTF-8,
UTF-16 and UTF-32) support all unicode characters, so they are all "enough".
 
That's somewhat misleading, as all unicode encodings (UTF-7, UTF-8,
UTF-16 and UTF-32) support all unicode characters, so they are all
"enough".

Wasn't that my initial statement?
"UTF-8, UTF-16, and UTF-32 are all equivalent, they all cover
the same range."

I did not feel like repeating the same thing 3 times in the same post :-)
 
Thanks for the info. I think I caused a bit of confusion. There are two
output formats that are
needed. One in Unicode the other in either ANSI/ASCII.

I will clarify a bit the lingo, so that we make sure we talk about the same
thing:

ASCII = the characters from 0 to 127
There are no accented characters there, so no support for Japanese, Chinese,
Russian or anything else.

ANSI = "the default system code page" or, as called in the Windows XP UI,
"language for non-Unicode programs"
That is not a fixed code page, it changes depending the OS.
On Win95 Japanese the ANSI code page is 932, on a Russian Win95 it will
be 1251, and so on.

So, it is possible to save Japanese in ANSI code page on a Japanese system.

The ANSI file output is not going to be opened with my program if it goes
on windows 9x but with notepad. So it can't be Unicode.
Right.


My understanding is when saving in ANSI it stores it according to system
regional settings. Which seems like best option.
So if I have Windows XP Japanese and store the file as ANSI, I would
think I would not have any issue when I open it up in Windows 98 Japanese.

Yes, you can have problems. XP is Unicode. Nobody prevents me from
using Thai file names (for instance), which are lost when converted to 932.

Going beyond that, the Japanese language (and Chinese, and others) use
characters that are not present in the ANSI code page.
People moved from the (crapy) OEM code pages (used in DOS) to ISO-based
code pages (Windows) to Unicode. Going back will loose info.
No way around that.

You should try to put in balance the troubles you have to go thru
in order to support Win 9x vs. the benefits. Staying Unicode all the
way is definitely easier.

Sure, it might be worth the trouble. Your call.
 
Neither ASCII or ANSI will support Japanese characters, but other
pre-Unicode character encodings may. If you have a specific character
encoding (presumably one of the common Japanese-enabled MBCS encodings)
in mind, then that's the encoding you need to emit as your alternative
to a Unicode encoding.

Pete

Thanks Pete.

I think I may need to read about this more.
 
That's somewhat misleading, as all unicode encodings (UTF-7, UTF-8,
UTF-16 and UTF-32) support all unicode characters, so they are all
"enough".


Thanks Göran,

You are right. It just I feel kind of more assured, as if I bigger is better. It is inane and I
think I might change the Encoding output to UTF-8.
 
Thanks Mihai, you really cleared up a lot of the confusion that I had.
Appreciate it.

ANSI = "the default system code page" or, as called in the Windows XP UI,
"language for non-Unicode programs"
That is not a fixed code page, it changes depending the OS.
On Win95 Japanese the ANSI code page is 932, on a Russian Win95 it will
be 1251, and so on.

Codepage, I guess .NET (C# to be more specific) doesn't have the ability to allow me to open a file
in a specific codepage(?)

I have to use API to get around it.
You should try to put in balance the troubles you have to go thru
in order to support Win 9x vs. the benefits. Staying Unicode all the
way is definitely easier.

Sure, it might be worth the trouble. Your call.

I know. I am for 100% Unicode but can't help it , as I have to support older output format.
 
You may be confusing the issue with your term "ANSI". I've ignored the
usual caveat, because I made the assumption that you understood that
there's not really any such thing as "ANSI" encoding.

It is not my term.
It is what Microsoft and MSDN calls "ANSI codepage"
If you want to understand the documentation, that's the meaning.
Look for stuff like GetACP: "Retrieves the current Windows ANSI code page
identifier for the system."

True, it is not "a code page approved by ANSI
(the American National Standards Institute)


On the other hand, from your most recent reply, it seems as though perhaps
you actually are referring to some particular MBCS encoding (in this case,
presumably one that supports Japanese characters).

In some cases what MS/MSDN calls ANSI code page might be indeed MBCS.


Neither ASCII or ANSI will support Japanese characters,

In the above "definition" of ANSI it will support Japanese.
And since there is no official ANSI code page...
I preffer to use "default system code page" instead of ANSI.
A Japanese system (or any other system with the "language for
non-Unicode Programs") set to Japanese will report 932 (a MBCS)
as "ANSI" code page.
(http://www.mihai-nita.net/article.php?artID=20050611a)


If you have a specific character
encoding (presumably one of the common Japanese-enabled MBCS encodings) in
mind, then that's the encoding you need to emit as your alternative to a
Unicode encoding.

No.

If you want to emit something that the Win 9x Notepad will understand
then the only option is what MS documentation calls "ANSI code page"
It does not matter what you have in mind, and it cannot be
"one of the common Japanese-enabled MBCS encodings"
It can *only* be Shift-JIS (cp932) (returned by GetACP on a Japanese Win 9x)

I was just warning that this so-called codepage is different between systems
(Notepad on a Japanese system will not be able to read Korean documents,
but a Notepad on a Korean system will do).
 
Codepage, I guess .NET (C# to be more specific) doesn't have the ability to
allow me to open a file in a specific codepage(?)

You get an Encoding with Encoding.GetEncoding to get a random encoder.
But for the ANSI code page (understood as "current system code page")
you can use directly Encoding.Default (the MSDN description is
"An encoding for the system's current ANSI code page.")

Then you can use it with a StreamReader/StreamWriter
(create a FileStream, then a BufferedStream based on that, then, finally,
a StreamReader/StreamWriter)

Somethign like this:
FileStream fso = new FileStream(fileName, FileMode.Create, FileAccess.Write);
BufferedStream bso = new BufferedStream( fso );
StreamWriter swo = new StreamWriter( bso, Encoding.Default );

Warning: I did not spend the time to compile this, but that't the idea
(of course, add error/exceptions handling, etc.)

I know. I am for 100% Unicode but can't help it , as I have to
support older output format.

Well, happens.
Idea: if acceptable for the user, write at the top of the file some
info on the code page used for the file.
(the way xml (optionally) does, with <?xml version="1.0" encoding="blah"?>)
You can even add something user friendly, but also easy to parse
(For instance:
Required code page: 932
If this file is opened on a Windows with a mismatching system code page,
it's content might look corrupted, and is most likely inintelligible
)
This way, if you can parse the first line and get the code page.
You can then use that (and Encoding.GetEncoding) to conver to Unicode.
So you will be able to use a Japanese-encoded file even on a Korean
(Russian, Greek, English, etc.) system, even if it looks bad in Notepad.
 
Well, happens.
Idea: if acceptable for the user, write at the top of the file some
info on the code page used for the file.
(the way xml (optionally) does, with <?xml version="1.0" encoding="blah"?>)
You can even add something user friendly, but also easy to parse
(For instance:
Required code page: 932
If this file is opened on a Windows with a mismatching system code page,
it's content might look corrupted, and is most likely inintelligible
)
This way, if you can parse the first line and get the code page.
You can then use that (and Encoding.GetEncoding) to conver to Unicode.
So you will be able to use a Japanese-encoded file even on a Korean
(Russian, Greek, English, etc.) system, even if it looks bad in Notepad.


Would have never crossed my mind. Simple solution.
Thanks for that. :)
 
Back
Top