Read UTF8 (mixed byte) file & convert to Unicode

Guest · Apr 7, 2005

I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a DB and display it onscreen. No
matter which way I open the file, convert it to Unicode/leave it as is or
what ever, I see all single bytes ok, but double bytes become 2 seperate
single bytes. Surely there is an easy way to convert these mixed bytes to
Unicode? Below is 2 (of many) attempts at doing the conversion. I was
expecting that Encoding.Convert would be able to do this. My HTML charset,
session codepage, locale, thread culture are all set correctly for Japanese.
(reading Japanese from a unicode file works).

Attempt 1:
Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open,
FileAccess.Read, FileShare.None)
Dim bytUTF8(Fs.Length) As Byte
Fs.Read(bytUTF8, 0, bytUTF8.Length)
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
Response.Write(Encoding.Unicode.GetString(bytUni))

Attempt 2:
reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"),
System.Text.Encoding.UTF8, True)
bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEnd())
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
lblMessage.Text = Encoding.Unicode.GetString(bytUni)

In ASP3 I had to pass the text through ADO to do the conversion which was
very ugly to do - surely that is not required now?
Thanks very much,
Hunter

Jon Skeet [C# MVP] · Apr 7, 2005

<"=?Utf-8?B?aHVudGVyYg==?=" <Hunter

I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a DB and display it onscreen. No
matter which way I open the file, convert it to Unicode/leave it as is or
what ever, I see all single bytes ok, but double bytes become 2 seperate
single bytes. Surely there is an easy way to convert these mixed bytes to
Unicode? Below is 2 (of many) attempts at doing the conversion. I was
expecting that Encoding.Convert would be able to do this. My HTML charset,
session codepage, locale, thread culture are all set correctly for Japanese.
(reading Japanese from a unicode file works).

Attempt 1:
Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open,
FileAccess.Read, FileShare.None)
Dim bytUTF8(Fs.Length) As Byte
Fs.Read(bytUTF8, 0, bytUTF8.Length)
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
Response.Write(Encoding.Unicode.GetString(bytUni))

Attempt 2:
reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"),
System.Text.Encoding.UTF8, True)
bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEnd())
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
lblMessage.Text = Encoding.Unicode.GetString(bytUni)

In ASP3 I had to pass the text through ADO to do the conversion which was
very ugly to do - surely that is not required now?

No. Your first problem is that you're reading the text in assuming it's
UTF-8, then converting it *back* to UTF-8 bytes, then treating those
bytes as if they were UTF-16 (Unicode) bytes. There's no need to
convert them into bytes again - reader.ReadToEnd() is giving you a
string, so just use that string!

Now, that assumes that the file is *actually* in UTF-8. In my
experience Japanese characters come out as 3 bytes in UTF-8, so you may
actually have a Shift-JIS file instead.

You should not that your first attempt doesn't guarantee to read the
whole file, by the way - see
http://www.pobox.com/~skeet/csharp/readbinary.html

For more information about Unicode issues, see
http://www.pobox.com/~skeet/csharp/unicode.html
http://www.pobox.com/~skeet/csharp/debuggingunicode.html

MasterGaurav · Apr 7, 2005

Is the HTML file really a Unicode file? Open the HTML file in notepad
and do a "Save As". What do you see in the "Encoding" option at the end
of the "Save As" dialog? Is it Unicode?

--
CHeers,
Gaurav Vaish
http://mastergaurav.org
http://mastergaurav.blogspot.com
--------------------------

Guest · Apr 8, 2005

I got it working after a bit. I big thankyou Jon.
Reconising the file encoding was most of the battle. It wasn't UTF8 at all -
the japanese chars were always 2 bytes, not 3. When i started thinking about
shift_jis the solution was obvious - in hindsight.

Instead of using: Encoding.UTF8
I used: Encoding.GetEncoding("shift_jis")

Thanks!

Encoding UTF8 to / from Unicode	2	Jul 19, 2005
MultiByteToWideChar in .NET - Multibyte to Unicode conversion	2	Nov 16, 2005
Convert OEM to Unicode	4	Dec 18, 2006
Convert large XML file to UTF-8	5	Aug 5, 2006
Converting byte array to Unicode string in C#	1	Feb 13, 2006
FWT Newsletter - Weekly - November 10, 2003	3	Nov 10, 2003

Read UTF8 (mixed byte) file & convert to Unicode

Guest

Jon Skeet [C# MVP]

MasterGaurav

Guest

Ask a Question

Similar Threads