Encode file with 'UCS-2 Little Endian'

Iain · Aug 7, 2009

Hi

I have a text file that I wish to set the encoding to 'UCS-2 Little Endian'.

Is there any functions/component available that will let me do this?

Regards
Iain

Martin Honnen · Aug 7, 2009

Iain said:
I have a text file that I wish to set the encoding to 'UCS-2 Little
Endian'.

Is there any functions/component available that will let me do this?

Isn't UCS-2 a predecessor to UTF-16
(http://en.wikipedia.org/wiki/UCS-2)? In that case you can use
System.Text.Encoding.Unicode
http://msdn.microsoft.com/en-us/library/system.text.encoding.unicode.aspx
to create and read text files with that encoding:
File.WriteAllText("file.txt", stringWithContents, Encoding.Unicode);
string contents = File.ReadAllText("file.txt", Encoding.Unicode);

I am however not sure what exactly "I have a text file that I wish to
set the encoding" means? Does that file already exist and you want to
overwrite it with a different encoding? Do you know the encoding of the
existing file?

Iain · Aug 7, 2009

Martin said:
Isn't UCS-2 a predecessor to UTF-16
(http://en.wikipedia.org/wiki/UCS-2)? In that case you can use
System.Text.Encoding.Unicode
http://msdn.microsoft.com/en-us/library/system.text.encoding.unicode.aspx
to create and read text files with that encoding:
File.WriteAllText("file.txt", stringWithContents, Encoding.Unicode);
string contents = File.ReadAllText("file.txt", Encoding.Unicode);

I am however not sure what exactly "I have a text file that I wish to
set the encoding" means? Does that file already exist and you want to
overwrite it with a different encoding? Do you know the encoding of the
existing file?

Yes the file already exists and is encoded as far as I know in ANSI. I
want to overwrite the existing file and apply UCS-2 Little Endian encoding.

Martin Honnen · Aug 7, 2009

Iain said:
Yes the file already exists and is encoded as far as I know in ANSI. I
want to overwrite the existing file and apply UCS-2 Little Endian encoding.

Try whether
string contents = File.ReadAllText("file.txt", Encoding.Default);
File.WriteAllText("file.txt", contents, Encoding.Unicode);
does what you want. Encoding.Default uses the local ANSI code page of
your system.

Iain · Aug 7, 2009

Martin said:
Try whether
string contents = File.ReadAllText("file.txt", Encoding.Default);
File.WriteAllText("file.txt", contents, Encoding.Unicode);
does what you want. Encoding.Default uses the local ANSI code page of
your system.

Thanks for the suggestion, will give it a go!

Iain · Aug 7, 2009

Iain said:
Thanks for the suggestion, will give it a go!

Tried the code above and the file seems to be converted to UniCode
(Notepad ++ reports the file type as UCS-2 Little Endian), but when
processing the file it contains corrupt characters.

Bit of backgroup information...

I am importing large SQL script files via OSQL.EXE into SQL Server, this
is when I came across the following :

http://kiribao.blogspot.com/2008/03/osql-and-input-file-encodings.html

If I save the file via Notepad ++ with the encoding UCS-2 Little Endian
when processing via OSQL.EXE it is fine and not corrupt characters appear.

However, if I process the file with the above code, when the file is
processed via OSQL.EXE corrupt characters appear...

Martin Honnen · Aug 7, 2009

Iain said:
Tried the code above and the file seems to be converted to UniCode
(Notepad ++ reports the file type as UCS-2 Little Endian), but when
processing the file it contains corrupt characters.

Bit of backgroup information...

I am importing large SQL script files via OSQL.EXE into SQL Server, this
is when I came across the following :

http://kiribao.blogspot.com/2008/03/osql-and-input-file-encodings.html

If I save the file via Notepad ++ with the encoding UCS-2 Little Endian
when processing via OSQL.EXE it is fine and not corrupt characters appear.

However, if I process the file with the above code, when the file is
processed via OSQL.EXE corrupt characters appear...

Well the code simply follows the requirements you have described, namely
taking an "ANSI" encoded file, reading it in, and saving as UTF-16. If
characters get corrupted that way then the input file is most likely not
encoded with Encoding.Default but rather some other encoding. I don't
know which one, you will have to find out yourself which it is.

Iain · Aug 7, 2009

Iain said:
However, if I process the file with the above code, when the file is
processed via OSQL.EXE corrupt characters appear...

Found a solution to the problem, altering the first line of the code to
encoding.UTF8 rather than encoding.Default.

The file now imports correctly.

Thanks for your help!

Peter Duniho · Aug 7, 2009

Found a solution to the problem, altering the first line of the code to
encoding.UTF8 rather than encoding.Default.

The file now imports correctly.

Thanks for your help!

Given your earlier statement that the file encoding is "ANSI" (not really
a specific encoding, but presumably you mean one of the 8-bit ASCII
variants), a more correct solution is probably to specify the encoding
that is _actually_ used in the file.

And actually, given that when you use UTF-8 on a file that you don't
expect to the be Unicode in the first place, that strongly suggests there
are only characters in the 1-128 range, so the encoding might actually be
ASCII.

In any case, you really need to find out what encoding the file is
specifically and use that encoding when reading the file.

Pete

Arne VajhÃ¸j · Aug 7, 2009

Peter said:
Given your earlier statement that the file encoding is "ANSI" (not
really a specific encoding, but presumably you mean one of the 8-bit
ASCII variants),

ANSI is technically a standardization body. But by tradition MS
programs uses the term for default single byte charset (usually
a codepage 12xx almost equivalent to ISO-8859-y). Which is also
what you mean by ASCII variants even though UTF-8 shares the
same ASCII characters as those do.

a more correct solution is probably to specify the
encoding that is _actually_ used in the file.

And actually, given that when you use UTF-8 on a file that you don't
expect to the be Unicode in the first place, that strongly suggests
there are only characters in the 1-128 range, so the encoding might
actually be ASCII.

In any case, you really need to find out what encoding the file is
specifically and use that encoding when reading the file.

I think he had.

Encoding.Default did not work and Encoding.UTF8 did work.

It seems as if the original guess that it was ANSI was wrong
and that it was really UTF-8.

Arne

Peter Duniho · Aug 7, 2009

[...]

In any case, you really need to find out what encoding the file is
specifically and use that encoding when reading the file.

Click to expand...

I think he had.

Encoding.Default did not work and Encoding.UTF8 did work.

It seems as if the original guess that it was ANSI was wrong
and that it was really UTF-8.

Could be. Or could be it's just a different ANSI code page than the
default one for his system.

It's great UTF-8 works for him. But I'm not convinced trial-and-error is
a great way to determine the file encoding, unless you only want to decode
a specific file once. Otherwise, the fact that character points
overlap/are shared can result in a particular encoding appearing to work
for some input, but not some other input.

Pete

Arne VajhÃ¸j · Aug 8, 2009

Peter said:
[...]

In any case, you really need to find out what encoding the file is
specifically and use that encoding when reading the file.

Click to expand...

I think he had.

Encoding.Default did not work and Encoding.UTF8 did work.

It seems as if the original guess that it was ANSI was wrong
and that it was really UTF-8.

Click to expand...

Could be. Or could be it's just a different ANSI code page than the
default one for his system.

I am not aware of any xx and yy where CP12xx is the correct
and CP12yy is corrupt and UTF-8 is not corrupt.

It does not fit the description.

Arne

Shawn B. · Aug 10, 2009

I am not aware of any xx and yy where CP12xx is the correct

and CP12yy is corrupt and UTF-8 is not corrupt.

It does not fit the description.

Click to expand...

I'm not aware of any specific example either. In fact, I would like to
assume that every ANSI codepage has identical characters for the first 128
characters, which as you suggest would preclude a possible error in this
case.

But trial-and-error is still not a very good way to determine file
encoding. The file came from _somewhere_. If one expects to receive more
files from that "somewhere", one needs to find out directly from that
"somewhere" what the encoding actually _is_, rather than making
assumptions.

[snip]

I've been bitten by this before. Encoding.Default uses the default encoding
of the OS. Where I work, on my workstation and most of our servers that
would be "windows-1252". When I think ANSI, that is what I'm expecting it
to be (something along the lines of ASCII). But I was having issues reading
properly a file generated on another developers workstation who is from
Russia. He had his workstation setup in such a way that Encoding.Default
did not represent "windows-1252" and thus, problems. It was close enough,
but when I processed the file and then output into UTF-8, the quotation
characters and angled apostrophe were incorrect.

The short and sweet to your comment, is that you are correct, that file came
from somewhere, and the correct encoding should be used for reading if you
are to successfully (or accurately, rather) convert into another encoding.
If reading as UTF-8 worked, and default didn't but really the file was
originally encoded with windows-1252, but the OP's workstation does not
default to windows-1252, that would explain why UTF-8 works. Encoding is a
real mess. Best taken with a hint of skill and a pound of art.

Just thought I'd share my two code pages in case someone wanted a
clarification into why your comment makes sense.

Arne VajhÃ¸j · Aug 10, 2009

Shawn said:
I've been bitten by this before. Encoding.Default uses the default
encoding of the OS. Where I work, on my workstation and most of our
servers that would be "windows-1252". When I think ANSI, that is what
I'm expecting it to be (something along the lines of ASCII). But I was
having issues reading properly a file generated on another developers
workstation who is from Russia. He had his workstation setup in such a
way that Encoding.Default did not represent "windows-1252" and thus,
problems. It was close enough, but when I processed the file and then
output into UTF-8, the quotation characters and angled apostrophe were
incorrect.

The short and sweet to your comment, is that you are correct, that file
came from somewhere, and the correct encoding should be used for reading
if you are to successfully (or accurately, rather) convert into another
encoding. If reading as UTF-8 worked, and default didn't but really the
file was originally encoded with windows-1252, but the OP's workstation
does not default to windows-1252, that would explain why UTF-8 works.
Encoding is a real mess. Best taken with a hint of skill and a pound of
art.

Just thought I'd share my two code pages in case someone wanted a
clarification into why your comment makes sense.

So what code points looks good in both CP-1252 and UTF-8 but
not in CP-12xx with xx != 52 ?

Arne

Shawn B. · Aug 10, 2009

I've been bitten by this before. Encoding.Default uses the default

So what code points looks good in both CP-1252 and UTF-8 but
not in CP-12xx with xx != 52 ?

I don't know without doing more research than I'm willing to do right now.
But what you could do (what I would do) is regenerate the file while reading
it against your original Encoding.Default, and write out the file into the
new encoding and keep a copy. Then do it again, but reading it as UTF8 and
write it out using the new encoding, and then compare the two files at a
binary level and see which bytes are different. Then you'll know what
characters were different (if it wasn't blatantly obvious already).

In my case, I have no idea what his default OS-level encoding was, but the
only characters that were encoded incocrrectly when I converted to UTF8 was
the angled single and the double quote, as well as the Â© (copyright) and â„¢
(upper TradeMark). No other characters where different. The Â© (copyright
circle-C) came out looking like the A with a curve over it and two dots. I
fixed it by reading it in as UTF-8 and outputting it as UTF-16. But if I
took his binary and ran it on my machine, it produced the correct CP-1262
(windows-1252) and I didn't have that problem.

All I'm saying is the differences are subtle, and may only be manifest when
using a character above 127 (such as the angled quotes and other special
non-ASCII characters). To see what those would be on your workstation, I'd
do a binary compare of the two outputs and then once identified the
incorrect characters, then try to determine what CP they are and thus, what
encoding.

Encode file with 'UCS-2 Little Endian'

Iain

Martin Honnen

Iain

Martin Honnen

Iain

Iain

Martin Honnen

Iain

Peter Duniho

Arne VajhÃ¸j

Peter Duniho

Arne VajhÃ¸j

Shawn B.

Arne VajhÃ¸j

Shawn B.