Junk characters when using StreamReader and StreamWriter

  • Thread starter Thread starter Rob
  • Start date Start date
R

Rob

Hi,
I have a VB.Net application that parses an HTML file. This file was an
MS Word document that was saved as web page.

My application removes all unnecessary code generated by MS Word and
does some custom formatting needed by my client.

I use a StreamReader to read in the file...regular expressions to parse
and clean up the file...and a StreamWriter to write the new file. On
some HTML files that I parse, I get this character "á" showing up in a
lot of places.
I've used different types of encoding for my StreamReader and
StreamWriter which works sometimes but then my application doesn't work
when I parse a French HTML file.

This is what I'm doing:

Dim filename As String = myfileinfo.OriginalFileName
Dim sr As New StreamReader(filename, System.Text.Encoding.Default)

Dim textstream As String = sr.ReadToEnd()
'close the stream...don't need it anymore
sr.Close()

Dim newtext As String

'send the stream for cleaning
newtext = CleanHTML(textstream)

Dim OutputFileName As String = myfileinfo.NewFolderPathToFile

Dim fs As New FileStream(OutputFileName, FileMode.Create,
FileAccess.Write)

Dim sw As New StreamWriter(fs, System.Text.Encoding.Default)

sw.WriteLine(newtext)
sw.Close()

These are the results:

<a href="#_Toc169072483"><span>99(15)0<span>á </span>INSTALMENT
PROGRAM</span></a></span></p>

<a href="#_Toc169072484"><span>99(15)1<span>á </span>OVERVIEW OF THE
INSTALMENT PROGRAM</span></a></span></p>

<p class="MsoNormal" ><span>áá </span>OP:<span>á
</span>BB<span>áááááááááááá </span>ACCT:<span>á </span>123456789
YR:<span>á </span>2004<span>áááááááááááááááááááááá </span>PG:<span>á
</span>1 of 1<span>áááááááááááááááááá </span>26FEB
2004 USER-ID</p>

I also have different characters representing quotes such as "ô" and "ö"
(not shown here).

When I do a "Save as Web Page" from MS Word 2000, the characters aren't
there but they show up after runnung through my program.

I'm assuming they represent characters that it doesn't understand but
does anyone know how to reslove this?


Thanks
Rob
 
Don't call CleanHTML function to see if it still happens. If yes, this is
likely an encoding problem (Word could perhaps save the file using an utf-8
encoding). If not it is something in your CleanHTML code...
 
á is just a single-byte ASCII character and Word just puts it into
the html file without encoding.

Why would you want to use System.Text.Encoding.Default?

Then you do not know what you are getting. You might get something
to work - until you run the application somewhere else in the world.

If you write the data out without going through the CleanHTML
function, do you then get a file which byte for byte is identical to
the original?

Regards,

Joergen Bech
 
Joergen Bech @ post1.tele.dk> said:
á is just a single-byte ASCII character and Word just puts it into
the html file without encoding.

Why would you want to use System.Text.Encoding.Default?

It will return the system's default Windows ANSI codepage. However, the
encoding used to encode the file must be used to decode the file too.
 
It will return the system's default Windows ANSI codepage. However, the
encoding used to encode the file must be used to decode the file too.

Yes, that is what I meant. If you grab the system code page it
could be anything - not necessarily suitable for reading a particular
html file.

Perhaps the charset information found in any Word-generated
html file might be of more use here?

/Joergen Bech
 
Thanks for your input guys, I think I've got it.
I've ran the program without calling the CleanHTML function and it
worked fine...so it must be the encoding when running this function.

The reason I used System.Text.Encoding.Default for the StreamReader and
StreamWriter is because any other encoding wouldn't work with both
english and french documents. When I used UTF8 for both...the french
side would remove french characters altogether so I used Default for the
StreamReader and UTF8 for the StreamWriter and it seems to be working
fine for both english and french documents.

I should mention that this application is only going to be used in house
and only used on english and french documents so I think I should be
fine.

Thanks again
 
Back
Top