R
Rob
Hi,
I have a VB.Net application that parses an HTML file. This file was an
MS Word document that was saved as web page.
My application removes all unnecessary code generated by MS Word and
does some custom formatting needed by my client.
I use a StreamReader to read in the file...regular expressions to parse
and clean up the file...and a StreamWriter to write the new file. On
some HTML files that I parse, I get this character "á" showing up in a
lot of places.
I've used different types of encoding for my StreamReader and
StreamWriter which works sometimes but then my application doesn't work
when I parse a French HTML file.
This is what I'm doing:
Dim filename As String = myfileinfo.OriginalFileName
Dim sr As New StreamReader(filename, System.Text.Encoding.Default)
Dim textstream As String = sr.ReadToEnd()
'close the stream...don't need it anymore
sr.Close()
Dim newtext As String
'send the stream for cleaning
newtext = CleanHTML(textstream)
Dim OutputFileName As String = myfileinfo.NewFolderPathToFile
Dim fs As New FileStream(OutputFileName, FileMode.Create,
FileAccess.Write)
Dim sw As New StreamWriter(fs, System.Text.Encoding.Default)
sw.WriteLine(newtext)
sw.Close()
These are the results:
<p class="MsoNormal" ><span>áá </span>OP:<span>á
</span>BB<span>áááááááááááá </span>ACCT:<span>á </span>123456789
YR:<span>á </span>2004<span>áááááááááááááááááááááá </span>PG:<span>á
</span>1 of 1<span>áááááááááááááááááá </span>26FEB
2004 USER-ID</p>
I also have different characters representing quotes such as "ô" and "ö"
(not shown here).
When I do a "Save as Web Page" from MS Word 2000, the characters aren't
there but they show up after runnung through my program.
I'm assuming they represent characters that it doesn't understand but
does anyone know how to reslove this?
Thanks
Rob
I have a VB.Net application that parses an HTML file. This file was an
MS Word document that was saved as web page.
My application removes all unnecessary code generated by MS Word and
does some custom formatting needed by my client.
I use a StreamReader to read in the file...regular expressions to parse
and clean up the file...and a StreamWriter to write the new file. On
some HTML files that I parse, I get this character "á" showing up in a
lot of places.
I've used different types of encoding for my StreamReader and
StreamWriter which works sometimes but then my application doesn't work
when I parse a French HTML file.
This is what I'm doing:
Dim filename As String = myfileinfo.OriginalFileName
Dim sr As New StreamReader(filename, System.Text.Encoding.Default)
Dim textstream As String = sr.ReadToEnd()
'close the stream...don't need it anymore
sr.Close()
Dim newtext As String
'send the stream for cleaning
newtext = CleanHTML(textstream)
Dim OutputFileName As String = myfileinfo.NewFolderPathToFile
Dim fs As New FileStream(OutputFileName, FileMode.Create,
FileAccess.Write)
Dim sw As New StreamWriter(fs, System.Text.Encoding.Default)
sw.WriteLine(newtext)
sw.Close()
These are the results:
PROGRAM</span></a></span></p><a href="#_Toc169072483"><span>99(15)0<span>á </span>INSTALMENT
INSTALMENT PROGRAM</span></a></span></p><a href="#_Toc169072484"><span>99(15)1<span>á </span>OVERVIEW OF THE
<p class="MsoNormal" ><span>áá </span>OP:<span>á
</span>BB<span>áááááááááááá </span>ACCT:<span>á </span>123456789
YR:<span>á </span>2004<span>áááááááááááááááááááááá </span>PG:<span>á
</span>1 of 1<span>áááááááááááááááááá </span>26FEB
2004 USER-ID</p>
I also have different characters representing quotes such as "ô" and "ö"
(not shown here).
When I do a "Save as Web Page" from MS Word 2000, the characters aren't
there but they show up after runnung through my program.
I'm assuming they represent characters that it doesn't understand but
does anyone know how to reslove this?
Thanks
Rob