Problem with embedded carriage returns

  • Thread starter Thread starter Harry
  • Start date Start date
H

Harry

Hi All

I have a .csv file that contains newline chars embedded between quotes in a
line of text eg

BSPADV1,John.public,9413,"Sharrock Ashley
TEST STREET 1
TEST NSW 2200",Address Insufficient,,,Mbase Print
Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley

There are actually some 19 columns of data but when a StreamReader.ReadLine
method tries to read line by line, it only returns data up to the first
embedded newline chars. Interestingly, the data does display correctly in
Notepad, so I guess Notepad must ignore newline chars inside quotation
marks.

Is there anyway to read the above line and get the full line of data? My
only thought so far is to use the ReadToEnd method and then try and remove
the newline chars between quotes programatically before Spltting on the
"real" newline chars.

Thanks for any help
 
Harry said:
I have a .csv file that contains newline chars embedded between quotes in
a line of text eg

BSPADV1,John.public,9413,"Sharrock Ashley
TEST STREET 1
TEST NSW 2200",Address Insufficient,,,Mbase Print
Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley

There are actually some 19 columns of data but when a
StreamReader.ReadLine method tries to read line by line, it only returns
data up to the first embedded newline chars. Interestingly, the data does
display correctly in Notepad, so I guess Notepad must ignore newline chars
inside quotation marks.

What do you mean by "displays correctly"? Notepad just displays the text
contained in the file.
Is there anyway to read the above line and get the full line of data? My
only thought so far is to use the ReadToEnd method and then try and remove
the newline chars between quotes programatically before Spltting on the
"real" newline chars.

This would be one possible approach. You may want to take a closer look at
regular expressions for simple "parsing" of the text file. Alternatively
you may want to read the file line-by-line, analyze each line and
concatenate the parts of a row which is split into multiple lines manually.
However, the best approach depends on what exactly you want to achieve.
 
Harry said:
I have a .csv file that contains newline chars embedded between quotes in
a line of text eg

BSPADV1,John.public,9413,"Sharrock Ashley
TEST STREET 1
TEST NSW 2200",Address Insufficient,,,Mbase Print
Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley

There are actually some 19 columns of data but when a
StreamReader.ReadLine method tries to read line by line, it only returns
data up to the first embedded newline chars. Interestingly, the data does
display correctly in Notepad, so I guess Notepad must ignore newline chars
inside quotation marks.

What do you mean by "displays correctly"? Notepad just displays the text
contained in the file.
Is there anyway to read the above line and get the full line of data? My
only thought so far is to use the ReadToEnd method and then try and remove
the newline chars between quotes programatically before Spltting on the
"real" newline chars.

This would be one possible approach. You may want to take a closer look at
regular expressions for simple "parsing" of the text file. Alternatively
you may want to read the file line-by-line, analyze each line and
concatenate the parts of a row which is split into multiple lines manually.
However, the best approach depends on what exactly you want to achieve.
 
Herfried K. Wagner said:
What do you mean by "displays correctly"? Notepad just displays the text
contained in the file.


This would be one possible approach. You may want to take a closer look
at regular expressions for simple "parsing" of the text file.
Alternatively you may want to read the file line-by-line, analyze each
line and concatenate the parts of a row which is split into multiple lines
manually. However, the best approach depends on what exactly you want to
achieve.

Thanks Herfreid for you answer and you excellent support of this group.

I have solved the problem by getting the StreamReader to load the entire
file into a string var and then doing a
buffer.Split(CChar(Environment.Newline)). I then read all the lines from the
array produced.

Seems SteamReader.Readline only grabs a row of data upto the first newline
char it encounters (kinda makes sense)

The method I have now employed seems to ignore any newline characters
enclosed in quotation marks.

Thanks again for your help

Harry
 
Herfried K. Wagner said:
What do you mean by "displays correctly"? Notepad just displays the text
contained in the file.


This would be one possible approach. You may want to take a closer look
at regular expressions for simple "parsing" of the text file.
Alternatively you may want to read the file line-by-line, analyze each
line and concatenate the parts of a row which is split into multiple lines
manually. However, the best approach depends on what exactly you want to
achieve.

Thanks Herfreid for you answer and you excellent support of this group.

I have solved the problem by getting the StreamReader to load the entire
file into a string var and then doing a
buffer.Split(CChar(Environment.Newline)). I then read all the lines from the
array produced.

Seems SteamReader.Readline only grabs a row of data upto the first newline
char it encounters (kinda makes sense)

The method I have now employed seems to ignore any newline characters
enclosed in quotation marks.

Thanks again for your help

Harry
 
You issue stems from your considering a 'newline' to be a single character.

Under Windows, it is, in fact, a pair of characters comprising a Carriage
Return character and a Line Feed character in that sequence. It is often
referred to as a Cr/Lf pair.

If what you describe is correct then the 'newline' imbbeded between quotes
is not a 'newline' at all.

I suspect it is a single character and is either a Cr or a Lf.

Notepad only recognizes a Cr/Lf pair ans a line-break and therefore would
show your 'line' unbroken. I further suspect that where the imbeded
'newline' should have been, Notepad would have shown an 'unprintable'
character which looks like a hollow rectangle.

The StreamReader.ReadLine() method defines a line as a sequence of
characters followed by a Line Feed character, a Carriage Return character or
a Cr/Lf pair.

In your case, the 'ReadToEnd' and a split on Environment.NewLine is the
appropriate course of action.
 
You issue stems from your considering a 'newline' to be a single character.

Under Windows, it is, in fact, a pair of characters comprising a Carriage
Return character and a Line Feed character in that sequence. It is often
referred to as a Cr/Lf pair.

If what you describe is correct then the 'newline' imbbeded between quotes
is not a 'newline' at all.

I suspect it is a single character and is either a Cr or a Lf.

Notepad only recognizes a Cr/Lf pair ans a line-break and therefore would
show your 'line' unbroken. I further suspect that where the imbeded
'newline' should have been, Notepad would have shown an 'unprintable'
character which looks like a hollow rectangle.

The StreamReader.ReadLine() method defines a line as a sequence of
characters followed by a Line Feed character, a Carriage Return character or
a Cr/Lf pair.

In your case, the 'ReadToEnd' and a split on Environment.NewLine is the
appropriate course of action.
 
You issue stems from your considering a 'newline' to be a single character.

Under Windows, it is, in fact, a pair of characters comprising a Carriage
Return character and a Line Feed character in that sequence. It is often
referred to as a Cr/Lf pair.

If what you describe is correct then the 'newline' imbbeded between quotes
is not a 'newline' at all.

I suspect it is a single character and is either a Cr or a Lf.

Notepad only recognizes a Cr/Lf pair ans a line-break and therefore would
show your 'line' unbroken. I further suspect that where the imbeded
'newline' should have been, Notepad would have shown an 'unprintable'
character which looks like a hollow rectangle.

The StreamReader.ReadLine() method defines a line as a sequence of
characters followed by a Line Feed character, a Carriage Return character or
a Cr/Lf pair.

In your case, the 'ReadToEnd' and a split on Environment.NewLine is the
appropriate course of action.

Unless he's dealing with a large data files, then it's a recipe for disaster.
 
You issue stems from your considering a 'newline' to be a single character.

Under Windows, it is, in fact, a pair of characters comprising a Carriage
Return character and a Line Feed character in that sequence. It is often
referred to as a Cr/Lf pair.

If what you describe is correct then the 'newline' imbbeded between quotes
is not a 'newline' at all.

I suspect it is a single character and is either a Cr or a Lf.

Notepad only recognizes a Cr/Lf pair ans a line-break and therefore would
show your 'line' unbroken. I further suspect that where the imbeded
'newline' should have been, Notepad would have shown an 'unprintable'
character which looks like a hollow rectangle.

The StreamReader.ReadLine() method defines a line as a sequence of
characters followed by a Line Feed character, a Carriage Return character or
a Cr/Lf pair.

In your case, the 'ReadToEnd' and a split on Environment.NewLine is the
appropriate course of action.

Unless he's dealing with a large data files, then it's a recipe for disaster.
 
Harry,

I don't know if it helps you.

However, CSV had in the beginning always (only) this format

For English use
, for embedding a numeric
"," for embedding a string

For non English use
; for embedding a numeric
";" for embedding a string

I thought this it is in fact still the same although a lot of people use
their own format where they simple use only comma's.

Cor
 
Harry,

I don't know if it helps you.

However, CSV had in the beginning always (only) this format

For English use
, for embedding a numeric
"," for embedding a string

For non English use
; for embedding a numeric
";" for embedding a string

I thought this it is in fact still the same although a lot of people use
their own format where they simple use only comma's.

Cor
 
Stephany.
I suspect it is a single character and is either a Cr or a Lf.

I thought is Stephany writing a SMS coded "Cor or a lover"

Before you become angry, I did not read that, only the Cr made me first read
Cor, when I was reading bottom up and from right to left.

But in fact it shows why I dislike the use of only a line feed or a carriage
return as the end of a string.

Most persons who use European characters first move their eyes a row lower
and then turn their eyes to the left (or vice versa) when they are reading.

Cor
 
Stephany.
I suspect it is a single character and is either a Cr or a Lf.

I thought is Stephany writing a SMS coded "Cor or a lover"

Before you become angry, I did not read that, only the Cr made me first read
Cor, when I was reading bottom up and from right to left.

But in fact it shows why I dislike the use of only a line feed or a carriage
return as the end of a string.

Most persons who use European characters first move their eyes a row lower
and then turn their eyes to the left (or vice versa) when they are reading.

Cor
 
Tom Shelton said:
Unless he's dealing with a large data files, then it's a recipe for
disaster.

You are correct, Tom. However, the files in question are always quite small
eg 4 to 10 rows of data.

I am intrigued, however, as to how you would go about reading in a large
file of this nature.

Stephany made a correct observation, btw, that the characters in between the
quotes were, in fact, chr(13) only. I should have noticed that.

Cheers
 
Tom Shelton said:
Unless he's dealing with a large data files, then it's a recipe for
disaster.

You are correct, Tom. However, the files in question are always quite small
eg 4 to 10 rows of data.

I am intrigued, however, as to how you would go about reading in a large
file of this nature.

Stephany made a correct observation, btw, that the characters in between the
quotes were, in fact, chr(13) only. I should have noticed that.

Cheers
 
Well now you've opened a can of worms Harry.

There is no silver bullet for this and there are a myriad of techniques you
can use.

The first thing you need to do is consider the provenance of the input file.

Is the program that creates the file a program that you can have some
influence over.

If so, then the technique that I would favour is to have write, what I call,
a special 'guard' character at the end of each full line. To read such a
file you would then use the ReadLine() method in a loop in the usual manner,
but, within each iteration of the loop, check to see if your buffer ends
with the 'guard character. If it does then you have the full line. If it
doesn't then repeatedly call the ReadLine() method and append the result to
your buffer until it does. Once you have a full line then process it in the
normal way and continue the main loop.

If not, then you need to jump throught some hoops. If the input file is
well-formed in that an opening quote always has a closing quote and the
extra 'newline's always fall betwen a pair of quotes then instead of
checking for a 'guard' character you could check to see if your buffer has
an even number of quote characters. Again append the result of interim
ReadLine() method calls until it does. This technique, of course, is less
efficient becaue it requires constant inspection of individual characters in
strings.

Another technique is to forget about the ReadLine() method completely,
decide on an arbitrary 'block' size and read that number of characters at a
time. For each block, find the last instance of a Cr/Lf pair. Append
everything before that pair to the remainder from the previous read as your
substantive block and save everything after that pair as the remainder.
Split the substantive block on Cr/Lf and process ecah line in the normal
way.

Mileage will vary when using any of these techniques due to file size,
length of line, amount of memory resources available etc.

The other factor is that everyone that you talk to will have their own
favourite technique that, of course, they consider to be far superior to
anybody else's ;)
 
Well now you've opened a can of worms Harry.

There is no silver bullet for this and there are a myriad of techniques you
can use.

The first thing you need to do is consider the provenance of the input file.

Is the program that creates the file a program that you can have some
influence over.

If so, then the technique that I would favour is to have write, what I call,
a special 'guard' character at the end of each full line. To read such a
file you would then use the ReadLine() method in a loop in the usual manner,
but, within each iteration of the loop, check to see if your buffer ends
with the 'guard character. If it does then you have the full line. If it
doesn't then repeatedly call the ReadLine() method and append the result to
your buffer until it does. Once you have a full line then process it in the
normal way and continue the main loop.

If not, then you need to jump throught some hoops. If the input file is
well-formed in that an opening quote always has a closing quote and the
extra 'newline's always fall betwen a pair of quotes then instead of
checking for a 'guard' character you could check to see if your buffer has
an even number of quote characters. Again append the result of interim
ReadLine() method calls until it does. This technique, of course, is less
efficient becaue it requires constant inspection of individual characters in
strings.

Another technique is to forget about the ReadLine() method completely,
decide on an arbitrary 'block' size and read that number of characters at a
time. For each block, find the last instance of a Cr/Lf pair. Append
everything before that pair to the remainder from the previous read as your
substantive block and save everything after that pair as the remainder.
Split the substantive block on Cr/Lf and process ecah line in the normal
way.

Mileage will vary when using any of these techniques due to file size,
length of line, amount of memory resources available etc.

The other factor is that everyone that you talk to will have their own
favourite technique that, of course, they consider to be far superior to
anybody else's ;)
 
Stephany Young said:
Well now you've opened a can of worms Harry.

There is no silver bullet for this and there are a myriad of techniques
you can use.

The first thing you need to do is consider the provenance of the input
file.

Is the program that creates the file a program that you can have some
influence over.

If so, then the technique that I would favour is to have write, what I
call, a special 'guard' character at the end of each full line. To read
such a file you would then use the ReadLine() method in a loop in the
usual manner, but, within each iteration of the loop, check to see if your
buffer ends with the 'guard character. If it does then you have the full
line. If it doesn't then repeatedly call the ReadLine() method and append
the result to your buffer until it does. Once you have a full line then
process it in the normal way and continue the main loop.

If not, then you need to jump throught some hoops. If the input file is
well-formed in that an opening quote always has a closing quote and the
extra 'newline's always fall betwen a pair of quotes then instead of
checking for a 'guard' character you could check to see if your buffer has
an even number of quote characters. Again append the result of interim
ReadLine() method calls until it does. This technique, of course, is less
efficient becaue it requires constant inspection of individual characters
in strings.

Another technique is to forget about the ReadLine() method completely,
decide on an arbitrary 'block' size and read that number of characters at
a time. For each block, find the last instance of a Cr/Lf pair. Append
everything before that pair to the remainder from the previous read as
your substantive block and save everything after that pair as the
remainder. Split the substantive block on Cr/Lf and process ecah line in
the normal way.

Mileage will vary when using any of these techniques due to file size,
length of line, amount of memory resources available etc.

The other factor is that everyone that you talk to will have their own
favourite technique that, of course, they consider to be far superior to
anybody else's ;)
Thanks Stephany

As usual your astute observations are spot on. You have raised an event in
my thought process.

Cheers

Harry
 
Stephany Young said:
Well now you've opened a can of worms Harry.

There is no silver bullet for this and there are a myriad of techniques
you can use.

The first thing you need to do is consider the provenance of the input
file.

Is the program that creates the file a program that you can have some
influence over.

If so, then the technique that I would favour is to have write, what I
call, a special 'guard' character at the end of each full line. To read
such a file you would then use the ReadLine() method in a loop in the
usual manner, but, within each iteration of the loop, check to see if your
buffer ends with the 'guard character. If it does then you have the full
line. If it doesn't then repeatedly call the ReadLine() method and append
the result to your buffer until it does. Once you have a full line then
process it in the normal way and continue the main loop.

If not, then you need to jump throught some hoops. If the input file is
well-formed in that an opening quote always has a closing quote and the
extra 'newline's always fall betwen a pair of quotes then instead of
checking for a 'guard' character you could check to see if your buffer has
an even number of quote characters. Again append the result of interim
ReadLine() method calls until it does. This technique, of course, is less
efficient becaue it requires constant inspection of individual characters
in strings.

Another technique is to forget about the ReadLine() method completely,
decide on an arbitrary 'block' size and read that number of characters at
a time. For each block, find the last instance of a Cr/Lf pair. Append
everything before that pair to the remainder from the previous read as
your substantive block and save everything after that pair as the
remainder. Split the substantive block on Cr/Lf and process ecah line in
the normal way.

Mileage will vary when using any of these techniques due to file size,
length of line, amount of memory resources available etc.

The other factor is that everyone that you talk to will have their own
favourite technique that, of course, they consider to be far superior to
anybody else's ;)
Thanks Stephany

As usual your astute observations are spot on. You have raised an event in
my thought process.

Cheers

Harry
 
Harry said:
Hi All

I have a .csv file that contains newline chars embedded between quotes in a
line of text eg

BSPADV1,John.public,9413,"Sharrock Ashley
TEST STREET 1
TEST NSW 2200",Address Insufficient,,,Mbase Print
Report,R7TDKPFMDBCLKE07CGJMFKKW6VVB/21,Sharrock Ashley

There are actually some 19 columns of data but when a StreamReader.ReadLine
method tries to read line by line, it only returns data up to the first
embedded newline chars. Interestingly, the data does display correctly in
Notepad, so I guess Notepad must ignore newline chars inside quotation
marks.

Is there anyway to read the above line and get the full line of data? My
only thought so far is to use the ReadToEnd method and then try and remove
the newline chars between quotes programatically before Spltting on the
"real" newline chars.

Thanks for any help

Harry, have you tried the builtin FileIO.TextFieldParser class?

It handles embedded eol characters, including embedded eol or commas
within quoted fields broken into two lines. Its handled for you. <g>
plus I presumed that it is cached as it reads, and Windows RTL already
caches text files, so it should be pretty efficient.

Here is an example console application with a ReaderCVS class wrapper
and example usage:

'--------------------------------------------------------------
' File : readcsf.vb
'--------------------------------------------------------------
Option Strict on
Option Explicit On

imports system
imports system.console
imports system.collections.generic

Module module1

public Class ReaderCVS

public shared data as new List(Of List(Of String))

'
' Read cvs file with max_fields, optional eolfilter
'
public shared function ReadCSF( _
byval fn as string, _
byval max_fields as integer, _
optional byval eolfilter as boolean = true) as boolean
try
data.clear
Dim rdr As FileIO.TextFieldParser
rdr = My.Computer.FileSystem.OpenTextFieldParser(fn)
rdr.SetDelimiters(",")

dim flds as new List(Of String)
While Not rdr.EndOfData()
dim lines as string() = rdr.ReadFields()
For Each fld as string In lines
if eolfilter then
'fld = fld.Replace(vbCr,"\r").Replace(vbLf,"\n")
fld = fld.Replace(vbCr," ").Replace(vbLf,"")
end if
flds.Add(fld)
if flds.count = max_fields then
data.Add(flds)
flds = new List(Of String)
end if
Next
End While
if flds.count > 0 then data.Add(flds)
return true

catch ex as exception
writeline(ex.message)
return false
end try
end function

sub Dump()
writeline("------- DUMP ")
dim nrow as integer = 0
for each flds as List(Of String) in data
nrow += 1
dim nfld as integer = 0
for each f as string in flds
nfld += 1
writeline("- r{0,-3} f{1,-3} => [{2}]",nrow, nfld, f)
next
next
end sub

end class

sub main(ByVal args() As String)
dim csf as new ReaderCvs
csf.readcsf("harry1.csv", 19) ' 19 fields
csf.dump()
readkey(true)
end sub

end Module

I used your example text as the harry1.csv test file.

I provided a eolfilter boolean to filter the <CR><LF> and played with
\r\n and space/blank replacements.

--
 
Back
Top