Unicode character in non-unicode text file

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

(this is follow-on message to one posted yesterday)

I'm trying to reproduce the capabilities in both Notepad and Excel, whereby
a Unicode text file with Unicode characters can be converted to ANSI, while
still preserving the unicode characters within.

Specifically, I'm using the unicode character &x2022, which is a largish
bullet.

I've tried this:

_writer = new System.IO.StreamWriter( file, false, new UnicodeEncoding());

which creates the text file just fine with regards to the bullet character,
but in Unicode format with BOM and all. When I save this file to ANSI with
Excel or Notepad, life is good.

BUT, when I try this:

_writer = new System.IO.StreamWriter( file, false, new
UnicodeEncoding(false, false));

The BOM is gone (good) but the bullet character gets converted into another
character.

What magic does Notepad/Excel use to preserve the character but lose the BOM?
 
Yes of course I'm very familiar with code pages. The question really isn't
about Notepad or Excel... I'm simply using them as examples to show that this
can be done (create non-unicode file with unicode character preserved)
*somehow*.

I'm just trying to understand what Notepad does (assuming english page 1033)
when it saves the Unicode file as ANSI, still managing to preserve the bullet
character.
 
dbaldi said:
(this is follow-on message to one posted yesterday)

I'm trying to reproduce the capabilities in both Notepad and Excel, whereby
a Unicode text file with Unicode characters can be converted to ANSI, while
still preserving the unicode characters within.

You can't. Excel and Notepad may happen to cope with the example you've
given, but I strongly suspect they don't do so reliably. They can't
possibly make every character in the appropriate ANSI encoding
available and cope with other characters unless they use escaping or
the like which would confuse other applications.
 
Yes, I understand its an impossible generic solution. But my problem scope
does not go beyond this one single character. This buillet is the only
non-ASCII character the system needs to support. I'm trying to explain to my
client why Excel can do it but I can't using System.IO (or, find a way to
make it work of course).

I'm trying to understand how excel does this:

1) Open Uncode file with this bullet character
2) Save As... ANSI file in Excel - bullet is still there, but Excel still
recognizes the file as ANSI
 
dbaldi said:
Yes, I understand its an impossible generic solution. But my problem scope
does not go beyond this one single character. This buillet is the only
non-ASCII character the system needs to support. I'm trying to explain to my
client why Excel can do it but I can't using System.IO (or, find a way to
make it work of course).

I'm trying to understand how excel does this:

1) Open Uncode file with this bullet character
2) Save As... ANSI file in Excel - bullet is still there, but Excel still
recognizes the file as ANSI

I suggest you examine the file with a hex editor, and see what byte
it's put there, and what character it should be in whichever ANSI code
page you're using.
 
I'm trying to reproduce the capabilities in both Notepad and Excel, whereby
a Unicode text file with Unicode characters can be converted to ANSI, while
still preserving the unicode characters within.
Question 1: "what ANSI"
For Windows, "ANSI" means the default system code page. This means 932 for
Japanese, 1251 for Russian and so on.

If by ANSI you mean the "Western European ANSI" (1252, Latin 1), then
you should have no problem U+2022 maps to 0x95
_writer = new System.IO.StreamWriter( file, false, new UnicodeEncoding());
which creates the text file just fine with regards to the bullet character,
but in Unicode format with BOM and all.
Normal. You ask for UnicodeEncoding, and you get Unicode Encoding.
Try System.Text.Encoding with Encoding.Default
What magic does Notepad/Excel use to preserve the character
but lose the BOM?
There is no magic. ANSI does not have BOM and the U+2022 bullet maps to
something that exists in 1252 (which I guess if you default system locale).
 
Back
Top