CSV decoding line-by-line

  • Thread starter Thread starter Phil Hibbs
  • Start date Start date
P

Phil Hibbs

I have a large CSV file that I need to read and filter into various
sheets based on some rules that test values in particular columns.
There are more than 64k rows, so I can't load it into one sheet first
and then split it up from there as this needs to be 2003-compatible.
I'm currently reading it line by line in a macro, breaking up each
line into its fields, applying the rules, and then writing each row to
the appropriate sheet. This is quite slow, I think mostly it's the
rules part of the code, but I was wondering if there was any way of
parsing a single row of CSV other than doing it the hard way? I
suppose I should try writing the entire line to a single cell and
calling Text To Columns on it and then sending it off to the
appropriate sheet. Maybe I could do, say, 10000 rows at a time this
way, and maybe use AutoFilter to apply the rules. Any other
suggestions?

Phil Hibbs.
 
Chip Pearson has a macro here:

http://www.cpearson.com/excel/ImportBigFiles.aspx

which allows you to import files with more than 64k records - it just
splits them up into different sheets. This will be quicker than
reading the file one line at a time. Perhaps you can sort the data in
the imported sheets to reflect your rules (or apply autofilter), and
then copy blocks of data into the sheets you need.

Hope this helps.

Pete
 
Also, you can import up to ~98000 lines using the Excel Text Import Wizard.
("import external data" on the Data menu)
You do this by specifying the start line of the imported data.
See... http://support.microsoft.com/kb/119770
--
Jim Cone
Portland, Oregon USA
http://www.mediafire.com/PrimitiveSoftware

..
..
..

"Phil Hibbs" <[email protected]>
wrote in message
I have a large CSV file that I need to read and filter into various
sheets based on some rules that test values in particular columns.
There are more than 64k rows, so I can't load it into one sheet first
and then split it up from there as this needs to be 2003-compatible.
I'm currently reading it line by line in a macro, breaking up each
line into its fields, applying the rules, and then writing each row to
the appropriate sheet. This is quite slow, I think mostly it's the
rules part of the code, but I was wondering if there was any way of
parsing a single row of CSV other than doing it the hard way? I
suppose I should try writing the entire line to a single cell and
calling Text To Columns on it and then sending it off to the
appropriate sheet. Maybe I could do, say, 10000 rows at a time this
way, and maybe use AutoFilter to apply the rules. Any other
suggestions?

Phil Hibbs.
 
Phil Hibbs formulated on Monday :
I have a large CSV file that I need to read and filter into various
sheets based on some rules that test values in particular columns.
There are more than 64k rows, so I can't load it into one sheet first
and then split it up from there as this needs to be 2003-compatible.
I'm currently reading it line by line in a macro, breaking up each
line into its fields, applying the rules, and then writing each row to
the appropriate sheet. This is quite slow, I think mostly it's the
rules part of the code, but I was wondering if there was any way of
parsing a single row of CSV other than doing it the hard way? I
suppose I should try writing the entire line to a single cell and
calling Text To Columns on it and then sending it off to the
appropriate sheet. Maybe I could do, say, 10000 rows at a time this
way, and maybe use AutoFilter to apply the rules. Any other
suggestions?

Phil Hibbs.

If it's a properly constructed CSV then you could load it into an array
of arrays, then loop the array for the desired value at its expected
position in each element.

Also, if it's a properly constructed CSV having the first line
containing the field names, you could use ADO and filter the data via a
WHERE clause and specify the field name and value to filter for. --If
the first line doesn't contain the field names then simply prepend a
line to the file contents and save it b efore running an ADO query.
 
GS said:
Also, if it's a properly constructed CSV having the first line
containing the field names, you could use ADO and filter the data via a
WHERE clause and specify the field name and value to filter for. --If
the first line doesn't contain the field names then simply prepend a
line to the file contents and save it b efore running an ADO query.

Unfortunately, the 4th line contains the column headings. That's one
of the reasons I need to process it line-by-line in VBA and decode
each line separately. I was kind of hoping there was some easy way to
take a string and parse it as a quoted CSV (or tab-separated, in the
case of another file I need to do something similar with). The SPLIT
function is not good enough as the data contains commas within quoted
values.

Phil Hibbs.
--
 
It happens that Phil Hibbs formulated :
Unfortunately, the 4th line contains the column headings. That's one
of the reasons I need to process it line-by-line in VBA and decode
each line separately. I was kind of hoping there was some easy way to
take a string and parse it as a quoted CSV (or tab-separated, in the
case of another file I need to do something similar with). The SPLIT
function is not good enough as the data contains commas within quoted
values.

Phil Hibbs.

This begs me to ask what the first 3 lines contain. Can these be
eliminated so the file starts with the headings? -Perhaps a temp file
that starts at line 4! If so, filtering by heading is a good way to
capture records for dumping to a worksheet in one shot. That way it
shouldn't matter what each line contains because the entire recordset
will be only lines that meet your WHERE clause criteria.

I'm thinking that you could dump the entire file into a variant using
the Split function and specifying vbCrLf as the delimiter. Then empty
the first 3 elements and Filter() the array to get rid of those. Then
dump the array back into a temp file using Join and vbCrLf as the
delimiter. Then run your ADO query.

May seem like a lot of work but the coding is rather simple and
straight forward. The process should be fairly fast, and (I suspect)
much easier to manage than your current parsing process.
 
GS said:
This begs me to ask what the first 3 lines contain. Can these be
eliminated so the file starts with the headings? -Perhaps a temp file
that starts at line 4!

Client name, date and time of report, and a blank line. I guess I
could write a temp file and then use the built-in facilities. I'm
reluctant to do individual ADO queries, though, as I need to split the
data out into 31 different sheets, and I don't want to have to read
through a 70,000 line file 31 times.

Phil.
 
Phil Hibbs pretended :
Client name, date and time of report, and a blank line. I guess I
could write a temp file and then use the built-in facilities. I'm
reluctant to do individual ADO queries, though, as I need to split the
data out into 31 different sheets, and I don't want to have to read
through a 70,000 line file 31 times.

Phil.

Well, the Tab-delimited files would be easy to dump into an array of
arrays. It's the Comma-delimited files that are going to be a problem
if, as you say, some individual values also contain commas. The easiest
way <IMO> to handle these is with loading lines into ADO recordsets,
which you'd only have to process if the recordset is not empty.

As it stands now, you have to evaluate each one of those 70,000 lines
to determine which sheet to put the data on, then redirect code to the
appropriate place to do that. Making 31 recordsets sounds easier and
quicker to me!<g>

Here's some code to restructure the CSVs:

Sub RestructureCSVs(ByVal FileIn As String, _
ByVal LinesToRemove As Long, _
Optional FileOut As String = "tmpCSV.dat")
' Opens a CSV file for editing.
' (Used to remove auxilliary lines before the line containing headings)
' Removes a specified number of lines from the start of the file.
' Dumps the filtered array into a reusable temp file.
' Requires ReadTextFileContents(), WriteTextFileContents()

Dim saLines() As String, i As Long

'Get all lines from the file
saLines() = Split(ReadTextFileContents(FileIn), vbCrLf)

'To quickly delete unwanted/empty lines, load them with vbNullChar
'and use the Filter() function to delete them
If LinesToRemove > 0 Then
For i = 0 To LinesToRemove - 1
saLines(i) = vbNullChar
Next
saLines() = Filter(saLines(), vbNullChar, False)

'Dump the contents into a temp file
FileOut = ThisWorkbook.Path & "\" & FileOut
WriteTextFileContents Join(saLines, vbCrLf), FileOut
End If
End Sub

Function ReadTextFileContents(Filename As String) As String
' Reuseable proc to read large amounts of data from a text file

Dim iNum As Integer, bFileIsOpen As Boolean

On Error GoTo ErrHandler
iNum = FreeFile()
Open Filename For Input As #iNum
bFileIsOpen = True '//if we got here the file opened successfully
ReadTextFileContents = Space$(LOF(iNum)) '//size our return string
'Read the entire contents in one single step
ReadTextFileContents = Input(LOF(iNum), iNum)

ErrHandler:
'Close the file
If bFileIsOpen Then Close #iNum
If Err Then Err.Raise Err.Number, , Err.Description
End Function 'ReadTextFileContents()

Sub WriteTextFileContents(Text As String, _
Filename As String, _
Optional AppendMode As Boolean = False)
' Reuseable proc to write/append large amounts of data to a text file

Dim iNum As Integer, bFileIsOpen As Boolean

On Error GoTo ErrHandler
iNum = FreeFile()
If AppendMode Then
Open Filename For Append As #iNum
Else
Open Filename For Output As #iNum
End If
bFileIsOpen = True '//if we got here the file opened successfully
Print #iNum, Text '//print to the file in one single step


ErrHandler:
'Close the file
If bFileIsOpen Then Close #iNum
If Err Then Err.Raise Err.Number, , Err.Description
End Sub 'WriteTextFileContents()

HTH
 
GS said:
Well, the Tab-delimited files would be easy to dump into an array of
arrays. It's the Comma-delimited files that are going to be a problem
if, as you say, some individual values also contain commas.

You'd think so. However, it gets better, it turns out the tab-
delimited file also contains tabs within quoted fields.

Phil.
--
 
Phil Hibbs used his keyboard to write :
You'd think so. However, it gets better, it turns out the tab-
delimited file also contains tabs within quoted fields.

Phil.

Well, that tells you something about how the files are configured. Now
if you can only get the source to do it right...!<g>

Again, it's looking even more like ADO is the way to go!
 
GS said:
Well, that tells you something about how the files are configured. Now
if you can only get the source to do it right...!<g>

The data IS right. It genuinely contains tabs, in quoted values.
Again, it's looking even more like ADO is the way to go!

I just thought of another reason why I can't use ADO - one of the 31
sheets is an "otherwise" bucket, anything that doesn't match any of
the other 30 sets of criteria gets written to the bucket sheet. That's
going to be one complicated query.

Phil.
--
 
Phil Hibbs wrote on 11/16/2010 :
The data IS right. It genuinely contains tabs, in quoted values.

The data may very well be 'right' as is. I was referring to the
structuring of the CSV as a proper data store. This usually includes
headings on the first line and strict use of unambiguous delimiters
that would not be confused with system delimiters OR punctuation in the
data values. For example, the Pipe "|" or Tilde "~" characters.
I just thought of another reason why I can't use ADO - one of the 31
sheets is an "otherwise" bucket, anything that doesn't match any of
the other 30 sets of criteria gets written to the bucket sheet. That's
going to be one complicated query.

Phil,
I would loop the file for the 'bucket' records first so they're out of
the way. Then I'd grab the other 30 sets with ADO and dump them on
their respective sheets.

Alternatively, you could do the 30 sets first, delete the records after
the dump so all that remains are 'bucket' records (-those that didn't
get grabbed by a query). Since they reside in a temp file deleting
should not be a problem.
 
If the CSV file is properly constructed then you could load it into
Notepad first and do a Find and Replace on it:

Find ","
Replace with "|"

(the quotes are needed), so that you are defining a delimiter
different than the commas which appear within your data. Save the file
with an extension of .txt

Then when you try to open that within Excel you will be taken to the
Data Import Wizard, where you can specify that it is delimited and you
can specify the delimiter | on the second panel. You can also specify
the start row (up to 32k), so that with 2 passes you can effectively
import 96k rows of data into two separate sheets, where you then need
to get rid of the duplicates which appear on rows 32k to 64k.

With it all in Excel, you could then apply your rules within a macro
to split the data into your 30+ sheets.

Hope this helps.

Pete
 
It happens that Pete_UK formulated :
If the CSV file is properly constructed...

And here lies the challenge.
then you could load it into
Notepad first and do a Find and Replace on it:

I suspect the OP wants to automate the entire process.
Find ","
Replace with "|"

(the quotes are needed), so that you are defining a delimiter
different than the commas which appear within your data.

How does that resolve? For example:

Lastname,Firstname,Street,Town_State_Zip,Phone
Jones,Davey,123 Somestreet,Sometown, AnyState 1234,1234567890

...where 'Sometown, AnyState 1234' is a single field of data. Albeit
that's not 'proper' data file construction since state and zip should
be singled out, but it's not uncommon to find.
 
The data IS right. It genuinely contains tabs, in quoted values.


I just thought of another reason why I can't use ADO - one of the 31
sheets is an "otherwise" bucket, anything that doesn't match any of
the other 30 sets of criteria gets written to the bucket sheet. That's
going to be one complicated query.

Phil.
--

Did you ever resolve this problem? If not please post a line form the
csv file that contains csv values and tabs and show how it is supposed
to be parsed, (including something like [tab] so the newsreader doesn't
try to render non-visible tabs).

I was thinking that I'd parse the whole file into arrays, one for each
sheet, then build the sheets when all of the file has been read.
 
Back
Top