Large text file to a CSV file...

  • Thread starter Thread starter Louie Warren
  • Start date Start date
L

Louie Warren

I have a HUGE text file I need to import into a database I
have created. I have imported it into Access and it is
over 400,000 rows! Using space delimiters in the import
worked for a majority of the data, but there was a good
portion that wasn't "aligned" correctly to work like
that. Is there some tool out there now, or has someone
done something with VB or the like that will go through
this thing line by line and parse out each field
correctly? Thanx!
 
Space, comma and Tab are dangerous as delimiters because the data can
contain those characters.
I usually use pipe delimiters | if there is a choice.

Delimited data doesn't need to be aligned so not sure what you mean there.

Perhaps you should post 2 samples - one of some ggod rows and one of some
bad rows.
 
I have a WebsitePro Access Log Combined (W3C/NCSA) Format
file; and yes, I know there is an ODBC routine out there
to do this, but my company doesn't want to spend the money
to get it because we are abandoning WebSite Pro in the
next few months. The data is on a classified system so I
will attempt to recreate the problem with generic data. I
already see problems with the blank delimiters...

An example from the book (string these together with a
space between fields.

172.18.96.157 (IP Address)

gryphonaire.com (Hostname of server
receiving request)

Admin (Authentication user name)

[02/Jan/2001:09:44:56 - 0800] (Date timestamp UTC offset)

"GET /members.html HTTP/1.1" (Complete HTTP request)

200 (HTTP Response Code)

178 (Number of bytes transfered)

"http://gryphonaire.com/index.html" (Complete referring
URL)

"Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)" (ID of
browser)

"Member=WebSiteRules" (Cookie)

The problem in the Access import is if the Authentication
username isn't available. WSP puts in a - This is
common for most of the requests. When an actual ID is
present, it doesn't fit the pattern with the others. The
data is consistantly like the above example (", [, and
all).

Hopefully that explains it.

L
 
Hi Louie,

Well, even after reading your repost I agree with Joe's
initial comment that it isn't clear why the variation in
data has anything to do with the space delimiter. From
the repost it sounds like sometimes the user name (Admin
in the sample) would be a dash instead, but that wouldn't
create a problem as long as a space was still inserted
after. Maybe I misread your post.

It does seem like you may have problems with embedded
spaces if all text fields are not qualified with quotes.

In any case, you can write code to open the text file and
read the contents (line by line) into a variable, work
with the contents of the line and then output it.

I have done this in cases where I received csv files from
third parties that had embedded quotes and commas that
were preventing the data from importing correctly. I
used code to go through line by line to identify problem
lines and replace the embedded quotes and commas.
Basically I first counted the quotes and commas and
checked for extras, if there were extras then it would go
through to identify the positions of the embedded ones
and replace them using some defined logic that worked in
all cases in that particular file.

In order to do this though, you have to be able to
identify some logic that will be able to correct the
problems in all cases. It sounds like you may be able to
do this since your data is fairly structured.

If you are interested, the code would look something like
the following (without the dim statements, file path
variable assignments and error handler). strFileName
(In/Out) would contain the path and file name to the
appropriate file.

intOutputHandle = FreeFile
Open FileNameOut For Output As #intOutputHandle

intInputHandle = FreeFile
Open strFileNameIn For Input As #intInputHandle

Do While Not EOF(intInputHandle)
Line Input #intInputHandle, strBuffer
(put your code here to analyze/manipulate strBuffer)
Print #intOutputHandle, strBuffer
Loop

Close #intInputHandle
Close #intOutputHandle

And, actually on mine I also output some statistics to a
second output file (such as line number, comma count, and
quote count). That way I could see which lines had been
manipulated and check some of them to see if the code
appeared to work correctly.

HTH, Ted Allen

-----Original Message-----
I have a WebsitePro Access Log Combined (W3C/NCSA) Format
file; and yes, I know there is an ODBC routine out there
to do this, but my company doesn't want to spend the money
to get it because we are abandoning WebSite Pro in the
next few months. The data is on a classified system so I
will attempt to recreate the problem with generic data. I
already see problems with the blank delimiters...

An example from the book (string these together with a
space between fields.

172.18.96.157 (IP Address)

gryphonaire.com (Hostname of server
receiving request)

Admin (Authentication user name)

[02/Jan/2001:09:44:56 - 0800] (Date timestamp UTC offset)

"GET /members.html HTTP/1.1" (Complete HTTP request)

200 (HTTP Response Code)

178 (Number of bytes transfered)

"http://gryphonaire.com/index.html" (Complete referring
URL)

"Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)" (ID of
browser)

"Member=WebSiteRules" (Cookie)

The problem in the Access import is if the Authentication
username isn't available. WSP puts in a - This is
common for most of the requests. When an actual ID is
present, it doesn't fit the pattern with the others. The
data is consistantly like the above example (", [, and
all).

Hopefully that explains it.

L

.
 
Hi Louie,

Well, even after reading your repost I agree with Joe's
initial comment that it isn't clear why the variation in
data has anything to do with the space delimiter. From
the repost it sounds like sometimes the user name (Admin
in the sample) would be a dash instead, but that wouldn't
create a problem as long as a space was still inserted
after. Maybe I misread your post.

It does seem like you may have problems with embedded
spaces if all text fields are not qualified with quotes.

In any case, you can write code to open the text file and
read the contents (line by line) into a variable, work
with the contents of the line and then output it.

I have done this in cases where I received csv files from
third parties that had embedded quotes and commas that
were preventing the data from importing correctly. I
used code to go through line by line to identify problem
lines and replace the embedded quotes and commas.
Basically I first counted the quotes and commas and
checked for extras, if there were extras then it would go
through to identify the positions of the embedded ones
and replace them using some defined logic that worked in
all cases in that particular file.

In order to do this though, you have to be able to
identify some logic that will be able to correct the
problems in all cases. It sounds like you may be able to
do this since your data is fairly structured.

If you are interested, the code would look something like
the following (without the dim statements, file path
variable assignments and error handler). strFileName
(In/Out) would contain the path and file name to the
appropriate file.

intOutputHandle = FreeFile
Open FileNameOut For Output As #intOutputHandle

intInputHandle = FreeFile
Open strFileNameIn For Input As #intInputHandle

Do While Not EOF(intInputHandle)
Line Input #intInputHandle, strBuffer
(put your code here to analyze/manipulate strBuffer)
Print #intOutputHandle, strBuffer
Loop

Close #intInputHandle
Close #intOutputHandle

And, actually on mine I also output some statistics to a
second output file (such as line number, comma count, and
quote count). That way I could see which lines had been
manipulated and check some of them to see if the code
appeared to work correctly.

HTH, Ted Allen

-----Original Message-----
I have a WebsitePro Access Log Combined (W3C/NCSA) Format
file; and yes, I know there is an ODBC routine out there
to do this, but my company doesn't want to spend the money
to get it because we are abandoning WebSite Pro in the
next few months. The data is on a classified system so I
will attempt to recreate the problem with generic data. I
already see problems with the blank delimiters...

An example from the book (string these together with a
space between fields.

172.18.96.157 (IP Address)

gryphonaire.com (Hostname of server
receiving request)

Admin (Authentication user name)

[02/Jan/2001:09:44:56 - 0800] (Date timestamp UTC offset)

"GET /members.html HTTP/1.1" (Complete HTTP request)

200 (HTTP Response Code)

178 (Number of bytes transfered)

"http://gryphonaire.com/index.html" (Complete referring
URL)

"Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)" (ID of
browser)

"Member=WebSiteRules" (Cookie)

The problem in the Access import is if the Authentication
username isn't available. WSP puts in a - This is
common for most of the requests. When an actual ID is
present, it doesn't fit the pattern with the others. The
data is consistantly like the above example (", [, and
all).

Hopefully that explains it.

L

.
 
Hi Louie,

Access's text import routine has problems with files like this, because
the space is not just being used as a delimiter, it also appears in the
data without being "qualified" by enclosing the entire field in quote
marks.

It looks as if the problematic field is the datestamp. Using "-" for an
unknown username shouldn't be a problem because it doesn't introduce any
extra spaces, though if WebSite Pro uses " - " it would.

One approach would be to pre-process the file by simply replacing the [
and ] around the datestamp with " and ". That should convert it into a
regular space-delimited file, with fields that may contain a space
qualified with " " . It would then be possible to import it to a
temporary Access table, where you could run a query to parse the
datestamp field and convert it to an Access date time value as you
append it to your "permanent" table.

The pre-processing could be done in a text editor, or if it's a regular
task you could work up a VBA routine using Line Input # to read the file
line by line, the Replace() function to change the line, and Print # to
write it to a new file. Then import the new file.

Another approach would be to use a regular expression engine to parse
the input line into fields and append these directly to your table.


I have a WebsitePro Access Log Combined (W3C/NCSA) Format
file; and yes, I know there is an ODBC routine out there
to do this, but my company doesn't want to spend the money
to get it because we are abandoning WebSite Pro in the
next few months. The data is on a classified system so I
will attempt to recreate the problem with generic data. I
already see problems with the blank delimiters...

An example from the book (string these together with a
space between fields.

172.18.96.157 (IP Address)

gryphonaire.com (Hostname of server
receiving request)

Admin (Authentication user name)

[02/Jan/2001:09:44:56 - 0800] (Date timestamp UTC offset)

"GET /members.html HTTP/1.1" (Complete HTTP request)

200 (HTTP Response Code)

178 (Number of bytes transfered)

"http://gryphonaire.com/index.html" (Complete referring
URL)

"Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)" (ID of
browser)

"Member=WebSiteRules" (Cookie)

The problem in the Access import is if the Authentication
username isn't available. WSP puts in a - This is
common for most of the requests. When an actual ID is
present, it doesn't fit the pattern with the others. The
data is consistantly like the above example (", [, and
all).

Hopefully that explains it.

L
 
Yes, I now know I wasn't that clear. Sorry. My problem
was not with being able to code something, but with being
able to use the Access text import function.
-----Original Message-----
Hi Louie,

Well, even after reading your repost I agree with Joe's
initial comment that it isn't clear why the variation in
data has anything to do with the space delimiter. From
the repost it sounds like sometimes the user name (Admin
in the sample) would be a dash instead, but that wouldn't
create a problem as long as a space was still inserted
after. Maybe I misread your post.

It does seem like you may have problems with embedded
spaces if all text fields are not qualified with quotes.

In any case, you can write code to open the text file and
read the contents (line by line) into a variable, work
with the contents of the line and then output it.

I have done this in cases where I received csv files from
third parties that had embedded quotes and commas that
were preventing the data from importing correctly. I
used code to go through line by line to identify problem
lines and replace the embedded quotes and commas.
Basically I first counted the quotes and commas and
checked for extras, if there were extras then it would go
through to identify the positions of the embedded ones
and replace them using some defined logic that worked in
all cases in that particular file.

In order to do this though, you have to be able to
identify some logic that will be able to correct the
problems in all cases. It sounds like you may be able to
do this since your data is fairly structured.

If you are interested, the code would look something like
the following (without the dim statements, file path
variable assignments and error handler). strFileName
(In/Out) would contain the path and file name to the
appropriate file.

intOutputHandle = FreeFile
Open FileNameOut For Output As #intOutputHandle

intInputHandle = FreeFile
Open strFileNameIn For Input As #intInputHandle

Do While Not EOF(intInputHandle)
Line Input #intInputHandle, strBuffer
(put your code here to analyze/manipulate strBuffer)
Print #intOutputHandle, strBuffer
Loop

Close #intInputHandle
Close #intOutputHandle

And, actually on mine I also output some statistics to a
second output file (such as line number, comma count, and
quote count). That way I could see which lines had been
manipulated and check some of them to see if the code
appeared to work correctly.

HTH, Ted Allen

-----Original Message-----
I have a WebsitePro Access Log Combined (W3C/NCSA) Format
file; and yes, I know there is an ODBC routine out there
to do this, but my company doesn't want to spend the money
to get it because we are abandoning WebSite Pro in the
next few months. The data is on a classified system so I
will attempt to recreate the problem with generic data. I
already see problems with the blank delimiters...

An example from the book (string these together with a
space between fields.

172.18.96.157 (IP Address)

gryphonaire.com (Hostname of server
receiving request)

Admin (Authentication user name)

[02/Jan/2001:09:44:56 - 0800] (Date timestamp UTC offset)

"GET /members.html HTTP/1.1" (Complete HTTP request)

200 (HTTP Response Code)

178 (Number of bytes transfered)

"http://gryphonaire.com/index.html" (Complete referring
URL)

"Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)" (ID of
browser)

"Member=WebSiteRules" (Cookie)

The problem in the Access import is if the Authentication
username isn't available. WSP puts in a - This is
common for most of the requests. When an actual ID is
present, it doesn't fit the pattern with the others. The
data is consistantly like the above example (", [, and
all).

Hopefully that explains it.

L

.
.
 
There's the rub... I would have to write a preprocessor
routine because a 3,094,112KB file would be difficult (to
say the least) to edit. WSP does imbed the - when there
is no value. I may be finding that this is more work than
I need to go through.

-----Original Message-----
Hi Louie,

Access's text import routine has problems with files like this, because
the space is not just being used as a delimiter, it also appears in the
data without being "qualified" by enclosing the entire field in quote
marks.

It looks as if the problematic field is the datestamp. Using "-" for an
unknown username shouldn't be a problem because it doesn't introduce any
extra spaces, though if WebSite Pro uses " - " it would.

One approach would be to pre-process the file by simply replacing the [
and ] around the datestamp with " and ". That should convert it into a
regular space-delimited file, with fields that may contain a space
qualified with " " . It would then be possible to import it to a
temporary Access table, where you could run a query to parse the
datestamp field and convert it to an Access date time value as you
append it to your "permanent" table.

The pre-processing could be done in a text editor, or if it's a regular
task you could work up a VBA routine using Line Input # to read the file
line by line, the Replace() function to change the line, and Print # to
write it to a new file. Then import the new file.

Another approach would be to use a regular expression engine to parse
the input line into fields and append these directly to your table.


I have a WebsitePro Access Log Combined (W3C/NCSA) Format
file; and yes, I know there is an ODBC routine out there
to do this, but my company doesn't want to spend the money
to get it because we are abandoning WebSite Pro in the
next few months. The data is on a classified system so I
will attempt to recreate the problem with generic data. I
already see problems with the blank delimiters...

An example from the book (string these together with a
space between fields.

172.18.96.157 (IP Address)

gryphonaire.com (Hostname of server
receiving request)

Admin (Authentication user name)

[02/Jan/2001:09:44:56 - 0800] (Date timestamp UTC offset)

"GET /members.html HTTP/1.1" (Complete HTTP request)

200 (HTTP Response Code)

178 (Number of bytes transfered)

"http://gryphonaire.com/index.html" (Complete referring
URL)

"Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)" (ID of
browser)

"Member=WebSiteRules" (Cookie)

The problem in the Access import is if the Authentication
username isn't available. WSP puts in a - This is
common for most of the requests. When an actual ID is
present, it doesn't fit the pattern with the others. The
data is consistantly like the above example (", [, and
all).

Hopefully that explains it.

L

--
John Nurick [Microsoft Access MVP]

Please respond in the newgroup and not by email.
.
 
Hi Louie,

Well, you may be able to get by without having to write
code to process the file if you can experiment with using
search and replace in a text editor that can handle the
file. As John mentioned, just replacing the []'s around
the dates with ""'s may be enough. The key is that any
fields that include spaces in the field values need to be
qualified with ""'s around them so that Access will not
interpret the spaces as field delimiters. When you are
importing the file, it is also important to specify that
the ""'s are the text qualifiers.

One other thing you could try is to import to Excel first
and save as an Excel file. I have noticed that Excel is
sometimes better (or at least different) in the way it
handles csv files. But, this may not be an option for
you if your file has more records than Excel can handle.

Other than that I think your only options that I know of
would be to try to find some other program that can
import the file, or write some VBA code to make whatever
changes are necessary in the text file prior to
importing. This really isn't much work for those
familiar with VBA, but there would obviously be a
learning curve if you aren't familiar. There are many in
this user group though that are happy to help with code
if you can define what you want it to do.

Hope that helps.

-Ted Allen
-----Original Message-----
There's the rub... I would have to write a preprocessor
routine because a 3,094,112KB file would be difficult (to
say the least) to edit. WSP does imbed the - when there
is no value. I may be finding that this is more work than
I need to go through.

-----Original Message-----
Hi Louie,

Access's text import routine has problems with files
like
this, because
the space is not just being used as a delimiter, it
also
appears in the
data without being "qualified" by enclosing the entire field in quote
marks.

It looks as if the problematic field is the datestamp. Using "-" for an
unknown username shouldn't be a problem because it doesn't introduce any
extra spaces, though if WebSite Pro uses " - " it would.

One approach would be to pre-process the file by simply replacing the [
and ] around the datestamp with " and ". That should convert it into a
regular space-delimited file, with fields that may contain a space
qualified with " " . It would then be possible to
import
it to a
temporary Access table, where you could run a query to parse the
datestamp field and convert it to an Access date time value as you
append it to your "permanent" table.

The pre-processing could be done in a text editor, or
if
it's a regular
task you could work up a VBA routine using Line Input # to read the file
line by line, the Replace() function to change the
line,
and Print # to
write it to a new file. Then import the new file.

Another approach would be to use a regular expression engine to parse
the input line into fields and append these directly to your table.
so
data.
I
already see problems with the blank delimiters...

An example from the book (string these together with a
space between fields.

172.18.96.157 (IP Address)

gryphonaire.com (Hostname of server
receiving request)

Admin (Authentication user name)

[02/Jan/2001:09:44:56 - 0800] (Date timestamp UTC offset)

"GET /members.html HTTP/1.1" (Complete HTTP request)

200 (HTTP Response Code)

178 (Number of bytes transfered)

"http://gryphonaire.com/index.html" (Complete referring
URL)

"Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)" (ID of
browser)

"Member=WebSiteRules" (Cookie)

The problem in the Access import is if the Authentication
username isn't available. WSP puts in a - This is
common for most of the requests. When an actual ID is
present, it doesn't fit the pattern with the others. The
data is consistantly like the above example (", [, and
all).

Hopefully that explains it.

L

--
John Nurick [Microsoft Access MVP]

Please respond in the newgroup and not by email.
.
.
 
Hi Louie,

Sorry for the slow response; I've been away from home.

If you've really got a 3,094,112 kB (about 3 GB) CSV file you're not
going to be able to import it into Access anyway: the maximum possible
size of an MDB file is 2GB, and that's also the limit on the size of an
MSDE database.

But it's easy for programs to modify huge textfiles. The trick is simply
to read a line at a time, process it, and then write the processed line
to a new file. This VBScript script converts brackets to quote marks
(and can be modified do make other changes):

Option Explicit
'VBScript template for processing textfiles line by line
Const BAK_EXT = ".bak"
Dim fso 'FileSystemObject
Dim fF 'File
Dim fIn, fOut 'Textstreams
Dim strL 'String


Function ProcessedLine(ByVal strLine)
'This does all the work.
'Put code here that returns a string to
'write to the output file
ProcessedLine = Replace(Replace(strLine, "[", """"), "]", """")
End Function


'Main body of script iterates through the file
'applying the ProcessLine function to each line
If WScript.Arguments.Count = 1 Then
Set fso = CreateObject("Scripting.FileSystemObject")
Set fF = fso.GetFile(WScript.Arguments(0))

'If backup file aready exists, delete it
If fso.FileExists(fF.Path & BAK_EXT) Then
fso.DeleteFile fF.Path & BAK_EXT
End If
fF.Name = fF.Name & BAK_EXT
Set fIn = fF.OpenAsTextStream(1) 'ForReading
Set fOut = fso.CreateTextFile(WScript.Arguments(0))

Do 'Read line by line
strL = fIn.ReadLine
strL = ProcessedLine(strL)
fOut.Write strL & vbCRLF
Loop Until fIn.AtEndOfStream

'Tidy up
fIn.Close
fOut.Close
Else
MsgBox "Drag a file onto the icon to process it. " _
& vbcrlf & "Original file will be renamed with " _
& ".bak extension"
End If



There's the rub... I would have to write a preprocessor
routine because a 3,094,112KB file would be difficult (to
say the least) to edit. WSP does imbed the - when there
is no value. I may be finding that this is more work than
I need to go through.

-----Original Message-----
Hi Louie,

Access's text import routine has problems with files like this, because
the space is not just being used as a delimiter, it also appears in the
data without being "qualified" by enclosing the entire field in quote
marks.

It looks as if the problematic field is the datestamp. Using "-" for an
unknown username shouldn't be a problem because it doesn't introduce any
extra spaces, though if WebSite Pro uses " - " it would.

One approach would be to pre-process the file by simply replacing the [
and ] around the datestamp with " and ". That should convert it into a
regular space-delimited file, with fields that may contain a space
qualified with " " . It would then be possible to import it to a
temporary Access table, where you could run a query to parse the
datestamp field and convert it to an Access date time value as you
append it to your "permanent" table.

The pre-processing could be done in a text editor, or if it's a regular
task you could work up a VBA routine using Line Input # to read the file
line by line, the Replace() function to change the line, and Print # to
write it to a new file. Then import the new file.

Another approach would be to use a regular expression engine to parse
the input line into fields and append these directly to your table.


I have a WebsitePro Access Log Combined (W3C/NCSA) Format
file; and yes, I know there is an ODBC routine out there
to do this, but my company doesn't want to spend the money
to get it because we are abandoning WebSite Pro in the
next few months. The data is on a classified system so I
will attempt to recreate the problem with generic data. I
already see problems with the blank delimiters...

An example from the book (string these together with a
space between fields.

172.18.96.157 (IP Address)

gryphonaire.com (Hostname of server
receiving request)

Admin (Authentication user name)

[02/Jan/2001:09:44:56 - 0800] (Date timestamp UTC offset)

"GET /members.html HTTP/1.1" (Complete HTTP request)

200 (HTTP Response Code)

178 (Number of bytes transfered)

"http://gryphonaire.com/index.html" (Complete referring
URL)

"Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)" (ID of
browser)

"Member=WebSiteRules" (Cookie)

The problem in the Access import is if the Authentication
username isn't available. WSP puts in a - This is
common for most of the requests. When an actual ID is
present, it doesn't fit the pattern with the others. The
data is consistantly like the above example (", [, and
all).

Hopefully that explains it.

L

--
John Nurick [Microsoft Access MVP]

Please respond in the newgroup and not by email.
.
 
Back
Top