StreamReader omits 0x93 and 0x94 when reading text file

  • Thread starter Thread starter Drew Berkemeyer
  • Start date Start date
D

Drew Berkemeyer

Hello,

I'm using the following code to read a text file in VB.NET.

Dim sr As StreamReader = File.OpenText(strFilePath)
Dim input As String = sr.ReadLine()

While Not input Is Nothing
strReturn += input + vbCrLf
input = sr.Read
End While

sr.Close()

For most cases this works fine. However, we have found that the opening
(0x93 - ") and closing (0x94 - ") quotation marks are being dropped without
warning or error.

Eg.
Original Text: This is some "quoted" text.
Text read in: This is some quoted text.

Does anyone have any clues as to what is going on here? Any advice is
appreciated.

Sincerely,
Drew Berkemeyer
 
* "Drew Berkemeyer said:
I'm using the following code to read a text file in VB.NET.

Dim sr As StreamReader = File.OpenText(strFilePath)
Dim input As String = sr.ReadLine()

While Not input Is Nothing
strReturn += input + vbCrLf
input = sr.Read

'Read' or 'ReadLine'?
 
Hi Drew,

In addition to Herfried's suggestion, it seems that the " represented the
0x22 in ASCII coding.
You may try to the code below.
Sub Main()
For i As Integer = 0 To 255
Console.WriteLine("0x" + Hex(i) + ": " + Chr(i))
Next
End Sub

We will find the result as below.
0x22: "

and
0x93:
0x94:
the two are control character they are not the printable character.
If I have any misunderstanding, please feel free to let me know.

Best regards,

Peter Huang
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Thank you for the reply, but...

ASCII 0x93 and 0x94 are *not* control characters. They are the open and
close quotes.

If I open notepad and type Alt+0147 (0x93) I get ". If I type Alt+0148
(0x94) I get ".

So, to answer my own question posted earlier... Here's the solution. I was
not using the proper Encoding. I had created a StreamReader like this:

Dim sr As StreamReader = New StreamReader(strFilePath)

The correct code for reading in a plain text file (which does not eat "
and " chars) is:

Dim sr As StreamReader = New StreamReader(strFilePath,
System.Text.Encoding.Default)
Dim input As String = sr.ReadLine()

While Not input Is Nothing
strReturn += input + vbCrLf
input = sr.ReadLine()
End While

sr.Close()

Thanks again for your help. I appreciate the effort.

- Drew
 
* "Drew Berkemeyer said:
Thank you for the reply, but...

ASCII 0x93 and 0x94 are *not* control characters. They are the open and
close quotes.

They are not quotes by definition. ASCII is a 7-bit encoding that
doesn't include more than 128 characters.

If I open notepad and type Alt+0147 (0x93) I get ". If I type Alt+0148
(0x94) I get ".

Right, but that's not ASCII.
 
Drew,
0x93 in ASCII is a 0x13 while 0x94 in ASCII is 0x14, as Herfried stated,
ASCII is a 7 bit characters, the high bit is ignored at best, exceptioned at
worst. (they are simply not valid ASCII).

When you open notepad you are in ANSI, with a specific code page. (the code
page is defined by the Windows Control Panel). ANSI is a full 8 bit
characters. 0x93 & 0x94 are typographic quote characters in the US ANSI code
page, I believe they are typographic quote characters in most European ANSI
code pages also.

Based on your original post, it appears you are using
System.IO.File.OpenText which opens the file in UTF-8, 0x93 & 0x94 are NOT
typographic quote characters in UTF-8! As UTF-8 is an 8-bit encoding for
Unicode, 8-bit Unicode characters are using for char points 128 to 255, the
typographic quote characters are 0x201C and 0x201D in Unicode.

To see a full explaination of Unicode and Encodings see:

http://www.yoda.arachsys.com/csharp/unicode.html


To see the Unicode code point for a character in Character Map, look in the
lower left corner of the window. It gives the Unicode code point, while the
lower right gives the ANSI/keyboard short cut. Note the "character set"
combo box in Character Map is the Encoding in .NET.

Common Unicode typographic quote chars include, but are not limited to:

' what most people think of quote chars
Const Apostrophe As Char = ChrW(&H27) ' single quotes
Const Quote As Char = ChrW(&H22) ' double quotes

' various typographic quote characters
Const LeftSingleQuote As Char = ChrW(&H2018)
Const RightSingleQuote As Char = ChrW(&H2019)
Const LeftDoubleQuote As Char = ChrW(&H201C)
Const RightDoubleQuote As Char = ChrW(&H201D)

' other typographic quote characters (international)
' Note: HP48 uses these for delimiters
Const LeftPointingDoubleAngleQuote As Char = ChrW(&HAB)
Const RightPointingDoubleAngleQuote As Char = ChrW(&HBB)

' other typographic quote characters (international)
Const SingleLow9Quote As Char = ChrW(&H201A)
Const SingleHighReversed9Quote As Char = ChrW(&H201B)
Const DoubleLow9Quote As Char = ChrW(&H201E)

The above are valid for Unicode encodings (UTF-8).

Hope this helps
Jay
 
Drew,
I should add, to read a text file (notepad file) in you default Windows
encoding you can use the following:

Imports System.Text

Dim sr As New StreamReader(strFilePath, Encoding.Default)

Hope this helps
Jay
 
Thank you to both Herfried and Jay.

I should have come back here sooner! I've spent the last week going round
and round with this only to discover exactly what you posted! <sigh> Thank
you so much.

Just as you explained, my problem is that I was using the incorrect Encoding
object. After much research of both the available options (which was not
straight forward research) and the content of the the file I am opening (in
this case notepad on Windows) I realized I should be using the following:

Imports System.Text

Dim sr As New StreamReader(strFilePath,
Encoding.GetEncoding("windows-1252")

I chose "windows-1252" because it will produce the same results on all
systems and is not dependent on (as pointed out by Jay) the default code
page settings of the computer running the code.

Thank you all for your assistance. I'm glad to get this one behind me and
learn something new.

Sincerely,
Drew Berkemeyer
 
Back
Top