Read cvs File Fails

  • Thread starter Thread starter rob
  • Start date Start date
R

rob

I am using Odbc to read cvs files. Unfortunately, some cvs files are
not formated correctly (out of my control). One particular problem is
that a quote within an item is not put in double quotes, i.e. the file
says

"this "item" is bad", "this item is ok"

rather then

"this ""item"" is bad", "this item is ok"

odbc now thinks 'this ' is the first item rather then 'this "item" is
bad'. Excel reads the file just fine, though. Is there some workaround,
short of fixing the file myself, to make the driver more error
tolerant? If it helps anything bellow is how I read the excel file.

Thanks


connectionString = @"Driver={Microsoft Text Driver (*.txt;
*.csv)};DBQ=" + Path.GetDirectoryName(filename);
connection = new OdbcConnection(connectionString);
connection.Open();
command = new OdbcCommand("Select * FROM " +
Path.GetFileName(filename), connection);
reader = command.ExecuteReader();
 
The first line "this "item" is bad", "this item is ok" is correct CSV
format. There is no field delimiter (comma) between the two quotes around
"item". Either your ODBC handler has a bug or you misconfiguring it.
Please post your code that configures your ODBC reader. Your other option
is to write a CSV line parser yourself, which isn't too difficult to do.

Mike Ober.
 
Thanks for the reply. I do not do any special configuration but just
what is shown in the original post. Then I read the fields in the
following ways:

while (reader.Read()) {
for (i = 0; i < reader.FieldCount; i++){
reader.GetValue(i);
}}

What kind of configuration do I have to do to get this working?

Thanks
 
I am using Odbc to read cvs files. Unfortunately, some cvs files are
not formated correctly (out of my control). One particular problem is
that a quote within an item is not put in double quotes, i.e. the file
says

"this "item" is bad", "this item is ok"

rather then

"this ""item"" is bad", "this item is ok"

odbc now thinks 'this ' is the first item rather then 'this "item" is
bad'. Excel reads the file just fine, though. Is there some workaround,
short of fixing the file myself, to make the driver more error
tolerant? If it helps anything bellow is how I read the excel file.

Thanks


connectionString = @"Driver={Microsoft Text Driver (*.txt;
*.csv)};DBQ=" + Path.GetDirectoryName(filename);
connection = new OdbcConnection(connectionString);
connection.Open();
command = new OdbcCommand("Select * FROM " +
Path.GetFileName(filename), connection);
reader = command.ExecuteReader();

Have you tried reading the files directly, writing code to split the columns out
with the split command? If the data has the commas in the correct places, the
double quotes won't make any difference. If you don't want the quotes you can
replace all of them with string.Empty AFTER YOU SPLIT THE LINE.

If you're trying to load the files into MS SQL Server, create a DTS Package
(assuming you're using SQL 2K) and do a bulk import into the table you're
filling.

I never use ODBC or any other data provider to read text files, for the very
reason you are experiencing.

This is cheating, but have you considered reading the file with Excel and saving
it as a tab delimited or comma delimited file that has no quotes? It won't be a
good solution if you are going to have to automate the process ;o)

What are you doing with the file after converting it?

Good luck with your project,

Otis Mukinfus
http://www.arltex.com
http://www.tomchilders.com
 
Otis said:
Have you tried reading the files directly, writing code to split the columns out
with the split command? If the data has the commas in the correct places, the
double quotes won't make any difference. If you don't want the quotes you can
replace all of them with string.Empty AFTER YOU SPLIT THE LINE.

If you're trying to load the files into MS SQL Server, create a DTS Package
(assuming you're using SQL 2K) and do a bulk import into the table you're
filling.

I never use ODBC or any other data provider to read text files, for the very
reason you are experiencing.

This is cheating, but have you considered reading the file with Excel and saving
it as a tab delimited or comma delimited file that has no quotes? It won't be a
good solution if you are going to have to automate the process ;o)

What are you doing with the file after converting it?

Good luck with your project,

My first approach was using split. The problem that the individual
items often also contain comas so split faild almost always. I tried
regex as well but again due to many special cases that fails as well. I
know that Excel can read the files correctly including all these
special (and maybe even wrong) cases. Hopeing that excel uses odbc
(rather then duplicating some code) I tried that approach. But it does
not seem to work, either.

I guess I might end up writing my own parser. It's probably not that
big an issue but of course it would be nicer if I had a done solution
that handles all the special cases.

My final goal is to download cvs files from the internet. Actually, I
don't store them as files but have them right in memory. Then I have to
do some parsing (avoiding duplicates, extract the right data, do some
error checking, etc) on the files and whatever I parsed out goes into
an SQL database.

Thanks
 
[snip]
My first approach was using split. The problem that the individual
items often also contain comas so split faild almost always. I tried
regex as well but again due to many special cases that fails as well. I
know that Excel can read the files correctly including all these
special (and maybe even wrong) cases. Hopeing that excel uses odbc
(rather then duplicating some code) I tried that approach. But it does
not seem to work, either.

I guess I might end up writing my own parser. It's probably not that
big an issue but of course it would be nicer if I had a done solution
that handles all the special cases.

My final goal is to download cvs files from the internet. Actually, I
don't store them as files but have them right in memory. Then I have to
do some parsing (avoiding duplicates, extract the right data, do some
error checking, etc) on the files and whatever I parsed out goes into
an SQL database.

Thanks

I've run into that situation where the commas are inside quotes. What I have
done in the past is to replace all the instances of "," with tabs '\t' and
remove the double quote left at the beginning and end of the file. Then you can
try splitting it on the tabs. It doesn't always work, but it's worth a try.

Good luck with your project,

Otis Mukinfus
http://www.arltex.com
http://www.tomchilders.com
 
Otis said:
[snip]
My first approach was using split. The problem that the individual
items often also contain comas so split faild almost always. I tried
regex as well but again due to many special cases that fails as well. I
know that Excel can read the files correctly including all these
special (and maybe even wrong) cases. Hopeing that excel uses odbc
(rather then duplicating some code) I tried that approach. But it does
not seem to work, either.

I guess I might end up writing my own parser. It's probably not that
big an issue but of course it would be nicer if I had a done solution
that handles all the special cases.

My final goal is to download cvs files from the internet. Actually, I
don't store them as files but have them right in memory. Then I have to
do some parsing (avoiding duplicates, extract the right data, do some
error checking, etc) on the files and whatever I parsed out goes into
an SQL database.

Thanks

I've run into that situation where the commas are inside quotes. What I have
done in the past is to replace all the instances of "," with tabs '\t' and
remove the double quote left at the beginning and end of the file. Then you can
try splitting it on the tabs. It doesn't always work, but it's worth a try.

Thanks for your input. I know this approach will not work for me,
though. For instance I have cases where a list of words are put in
double quotes and separated by comas. Doing a general replace of ","
with [Tab] would separate them where it shouldn't. There might be other
scenarios where some items are put in double quotes (the once cotaining
comas and double quotes) and items that are not put in double quotes
(the once NOT containing comas or double quotes). I am sure the more I
will use the stuff the more other scenarios will come up. That is why I
would prefer not to write all that stuff myself but use some existing
(Microsoft) solution.

The only thing I can imagine right now is downloading the data from the
internet, store it to a file, load the file in Excel and then read each
individual cell from Excel. It's a huge overkill compared to just
download the data and parse it myself in memory, though.

Any other ideas would be appreciated.

Thanks
 
You said that Excel opens this correctly. When I tried it with my Excel
'03, it did not appear to me to open the line "correctly". It dropped
the quote before item, left both fields surrounded by quotes, and left
a space in front of the second field. You're going to have a rough time
finding anything that parses this "correctly" as it's actually an
impossibility to pull the data out of this in a stable manner, which is
why the rules are there in the first place. If the rules are broken,
all parsing rules are out. You're going to have to decide what rules
you can make about the file, and what you're willing to give up. If you
can for instance always know that there's supposed to be 7 fields per
line, you can use a normal parser first. If it doesn't return 7 fields
per line, you can bail out to a different parser that can try the
replacing of "," with tab or whatever to see if you can then get 7
fields per line. Obviously, there's going to be situations where this
will read the data incorrectly, but when you're dealing with an
incorrect format, sometimes you have to just get the highest
percentages possible and go from there. If you can't find any rules
like there must be a certain number of fields, or certain fields must
contain certain data, then I wish you all the best in trying to find a
way of making it stable, but it's very unlikely.

Bruce Dunwiddie
http://www.csvreader.com
 
In my first post I actually said that it is not in a correct format.
Michael responded and said it IS a correct csv format. Since I am no
expert in csv I did not want to claim otherwise but it was hard to
believe. The reason is that then "this ""item"" is bad" should show up
in Excel as 'this ""item"" is bad' which it does not, i.e. it shows up
as 'this "item" is bad'.

The space is an error on my side. I just wrote down an representative
example and accidentaly put the space in there.
 
Bruce,

Thanks for the reply. You are correct that the fields do not show up
correctly. What I meant to say is that Excel is able to separate the
fields correctly. Actually, I am not too much interested in the fields
that might not be rendered correctly. The important fields contain very
simple data that is hard to get wrong. The more complex fields with
double quotes, new line characters, commas, etc just have to be
separated correctly. Excel is able to do that and this is a valid
assumption that I can make (otherwise the publisher would be in
troubles). So I was hoping that the parsing algorithm from Excel would
in one way or another be accessible through the .NET framework.

In any case, with the assumption I can make about the file, while not
trivial, it should not be that hard to write my own parsing algorithm.
But of course if I can assume that the file can be read by Excel and
the parsing algorithm is accessible then this is my prefered method.

Thanks for the input.
 
In the definition for CSV format, quotes are included in the text by using
back-to-back quotes. So "Hi ""George"" Mason" should be interpreted by a
good CSV parser as

Hi "George" Mason

Problem is that a lot of CSV parsers are poorly written, and a lot of CSV
exporters are poorly written. Personally I like to go with Tab-delimited
format or some other non-CSV format when I can... Makes life a lot easier.
You can Google a CSV parser Regular Expression as well.
 
Back
Top