Writing UTF-8 html files to disk (is this a bug?)

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hi,
Im developing a webapplication that writes some "cache html files" to the filesystem.
But i ran into some problems, when i used TextWriter txw = File.CreateText(thefile)... txw.Write(stringtowrite)... txw.flust and close();
sometimes i got funny chars im my html...
After some testing i found that if i opened the file in notepad and saved it, the funny chars disappeared...
more testing... binary compare of the file before opened in notepad and saved and after... there vas a 3 bytes "magic number" in the file opned in notepad and saved...
it was 0xef 0xbb 0xbf ???? why not by default??

The workaround ( this cannot be the right way ):

BinaryWriter bw = new System.IO.BinaryWriter(System.IO.File.Create(filenameAndPath));
bw.Write(new byte[]{0xef,0xbb,0xbf});
bw.Write(sb.ToString().ToCharArray());
bw.Flush();
bw.Close();

This works but, IMO its crappy code....
Anybody know the right way of dooing just that???

Thanks in advance
Danny Hille
 
Danny Hille said:
Im developing a webapplication that writes some "cache html files" to
the filesystem.
But i ran into some problems, when i used TextWriter txw =
File.CreateText(thefile)... txw.Write(stringtowrite)... txw.flust and
close();
sometimes i got funny chars im my html...

That suggests that your HTML isn't specifying the encoding of the file
properly.
After some testing i found that if i opened the file in notepad and
saved it, the funny chars disappeared...
more testing... binary compare of the file before opened in notepad
and saved and after... there vas a 3 bytes "magic number" in the file
opned in notepad and saved...
it was 0xef 0xbb 0xbf ???? why not by default??

This is called a byte order mark, and is only written in certain
circumstances. The BinaryWriter doesn't know that you want to write it
(because you didn't ask it to) so it hasn't included it.

Have a look at Encoding.GetPreamble() for a bit more information.
StreamWriter emits the preamble automatically if it's at the start of a
stream.
The workaround ( this cannot be the right way ):

BinaryWriter bw = new System.IO.BinaryWriter(System.IO.File.Create(filenameAndPath));
bw.Write(new byte[]{0xef,0xbb,0xbf});
bw.Write(sb.ToString().ToCharArray());
bw.Flush();
bw.Close();

This works but, IMO its crappy code....
Anybody know the right way of dooing just that???

I'm not sure why you're using BinaryWriter at all, given that you're
never writing any binary. Why not just use:

using (StreamWriter writer = new StreamWriter
(filenameAndPath, Encoding.UTF8))
{
writer.Write (sb.ToString());
}

(Note the using statement, which automatically closes the writer
whether or not an exception is thrown.)

Note the explicit use of Encoding.UTF8 rather than just using the
default encoding, which is UTF-8 but without the preamble.
 
Hey Danny,

The extra bytes make up the unicode preamble. See the Encoding.GetPreamble method in the docs for more information. If you want to create a file using the UTF-8 encoding, you should create an instance of the UTF8Encoding class and use this upon opening the file (for example by using StreamWriter). One of the constructor overloads for this class has a boolean parameter specifying whether the encoder should emit the preamble:

http://msdn.microsoft.com/library/d...rlrfsystemtextutf8encodingclassctortopic2.asp

I hope this answers your question.

Regards, Jakob.
 
Back
Top