Multiple tab-delimited files without knowing structure

  • Thread starter Thread starter Dave
  • Start date Start date
D

Dave

Ok, I have a very special case that I'm trying to work with here. I
can think of ways to accomplish it, but none are quick, and none seem
remotely like a best practice for this particular task.

My goal is to have a DataTable with a set of various types of data,
some of which are calculated (read: heavily massaged from static
data), and some of which come from a set of tab delimited files. Now,
I have a few issues that prevent this from being a normal one line
import job.

1. I do not know the format. Columns change and disappear depending on
configuration on that end. Obviously, XML storage would make much more
sense, but this comes from back in VB6 days, and while we hope to
rewrite it, that's just not going to happen soon. Thus, I know it's
tab delimited, my users will input which columns represent what I
need, and each file does have a header row.

2. Multiple files, each with a header. To make it worse, the file
format MIGHT change from file to file, such that file1 might have 108
columns and file2 may have 342 columns. However, to make it easy, both
are guaranteed by design to have at least what I need. What this
means, though, is that I can't just make a huge datatable with
everything and then get what I need from it. I must only select out
the columns I know are there and should be needed. Furthermore, all
selected files need to be in a single DataTable.

3. I somewhat mentioned this in #2, but I only need to pull out
columns that I need. Some of these files are hundreds of MB in size,
and if I'm processing several at once, well, you get the picture. I
usually only need maybe 5-15MB of that anyway.

I've already made the assumption that I first need to get the main
table out, and then form secondary datatable(s) by looping through
that data (some fairly complex massaging going on). I can do that part
fine, the part I'm having difficulty with is actually getting this
files into a single trimmed-down datatable.

Also, I'd prefer to avoid the stupid schema.ini file I've seen in
similar solutions around the web. Not sure if this is possible, but if
not then it's an incredibly dumb requirement unless I'm missing
something, but will deal with it. I guess it's still better than COM.

Appreciate any help!

Dave
 
Dave,

The datatable is build around the limitations of current SQL servers (not
only the one from Microsoft)

In my idea are those limitations exactly the oposite from what you want.



Cor
 
I would investigate DTS or SqlBulk copy. Once you have determined the
schema, execute a script that builds the working table in SQL Server.
SqlBulk copy can take it from there--importing directly from the CSV file.
--
____________________________________
William (Bill) Vaughn
Author, Mentor, Consultant
Microsoft MVP
INETA Speaker
www.betav.com/blog/billva
www.betav.com
Please reply only to the newsgroup so that others can benefit.
This posting is provided "AS IS" with no warranties, and confers no rights.
__________________________________
Visit www.hitchhikerguides.net to get more information on my latest book:
Hitchhiker's Guide to Visual Studio and SQL Server (7th Edition)
and Hitchhiker's Guide to SQL Server 2005 Compact Edition (EBook)
-----------------------------------------------------------------------------------------------------------------------
 
¤ Ok, I have a very special case that I'm trying to work with here. I
¤ can think of ways to accomplish it, but none are quick, and none seem
¤ remotely like a best practice for this particular task.
¤
¤ My goal is to have a DataTable with a set of various types of data,
¤ some of which are calculated (read: heavily massaged from static
¤ data), and some of which come from a set of tab delimited files. Now,
¤ I have a few issues that prevent this from being a normal one line
¤ import job.
¤
¤ 1. I do not know the format. Columns change and disappear depending on
¤ configuration on that end. Obviously, XML storage would make much more
¤ sense, but this comes from back in VB6 days, and while we hope to
¤ rewrite it, that's just not going to happen soon. Thus, I know it's
¤ tab delimited, my users will input which columns represent what I
¤ need, and each file does have a header row.
¤
¤ 2. Multiple files, each with a header. To make it worse, the file
¤ format MIGHT change from file to file, such that file1 might have 108
¤ columns and file2 may have 342 columns. However, to make it easy, both
¤ are guaranteed by design to have at least what I need. What this
¤ means, though, is that I can't just make a huge datatable with
¤ everything and then get what I need from it. I must only select out
¤ the columns I know are there and should be needed. Furthermore, all
¤ selected files need to be in a single DataTable.
¤
¤ 3. I somewhat mentioned this in #2, but I only need to pull out
¤ columns that I need. Some of these files are hundreds of MB in size,
¤ and if I'm processing several at once, well, you get the picture. I
¤ usually only need maybe 5-15MB of that anyway.
¤
¤ I've already made the assumption that I first need to get the main
¤ table out, and then form secondary datatable(s) by looping through
¤ that data (some fairly complex massaging going on). I can do that part
¤ fine, the part I'm having difficulty with is actually getting this
¤ files into a single trimmed-down datatable.
¤
¤ Also, I'd prefer to avoid the stupid schema.ini file I've seen in
¤ similar solutions around the web. Not sure if this is possible, but if
¤ not then it's an incredibly dumb requirement unless I'm missing
¤ something, but will deal with it. I guess it's still better than COM.

If you are reading from a tab delimited file a schema.ini file is required in order to define the
file format. The only alternative is to make a registry entry change which defines the default text
file format type. The current default would be csv.

If you were to place all files in a single DataTable then a Union query would be required. I don't
think that this would work if the files are in different formats.

This is looking more like a file I/O solution than something you can do using the ADO.NET libraries.


Paul
~~~~
Microsoft MVP (Visual Basic)
 
I'll try this again. Do you know the names of the fields you are interested
in and are they always available in the header row? Also, is the header row
easy to parse with the string.Split method? If the answer is yes to both of
these, I'll post my CSVLine class that handles anything Excel can handle,
but it doesn't use the ADO.NET or Excel libraries.

Mike Ober.
 
I'll try this again. Do you know the names of the fields you are interested
in and are they always available in the header row? Also, is the header row
easy to parse with the string.Split method? If the answer is yes to both of
these, I'll post my CSVLine class that handles anything Excel can handle,
but it doesn't use the ADO.NET or Excel libraries.

Yes to both. I don't know the names of the fields up-front, but they
are defined by the user and we can assume they are always available
(users are trained in this case). The holdup is simply that in the
addition to the 5-6 columns I'm actually interested in, there are
anywhere from 100-300 columns I'm not, and all are in random orders.

A few questions about your class, and I'd greatly appreciate it!:
1. Do you know how it handles performance-wise for hundreds of
thousands of records in a file?
2. Does it import multiple files, or how would I go about doing that?
3. What is the final data structure it returns? It doesn't have to be
a DataTable per se, I was hoping that ADO had advanced to allow
something like I'm trying to do (i.e. why can't I simply call a union
query on two files, each with a select statement built with the
columns I want).

Appreciate your help.
 
I would investigate DTS or SqlBulk copy. Once you have determined the
schema, execute a script that builds the working table in SQL Server.
SqlBulk copy can take it from there--importing directly from the CSV file.

The problem is I don't have access to SQL Server, or even Access. This
is a pure desktop app. I was mainly going off all the hype of ADO,
when MS evangelists promised "It doesn't matter what your data is;
it's all a database to .NET!" If I had SQL Server access, I would have
used DTS for this and run queries to get the data I need... Much
easier and faster.

Now, if somehow I can use an objectified version of those within
a .NET application without SQL Server installed, please let me know,
as that might be the cleanest solution.

I appreciate your help!
 
If you are reading from a tab delimited file a schema.ini file is required in order to define the
file format. The only alternative is to make a registry entry change which defines the default text
file format type. The current default would be csv.

Hmm, it really makes you wonder why you can't give that configuration
within the DataAdapter used to get the data. Seriously, this would
take me 5 lines in ruby, if that. Unfortunately, ruby doesn't have
WinForms. :)
If you were to place all files in a single DataTable then a Union query would be required. I don't
think that this would work if the files are in different formats.

That's what I was worried about. Is there any way to select out the
columns I need from both into some intermediate set of structures, and
then use some sort of union query on those objects to get a single
DataTable? I'm guessing no, since you can't even delete columns from a
DataTable.
This is looking more like a file I/O solution than something you can do using the ADO.NET libraries.

Trust me, they now are fully aware of why it should have been XML or
database storage from day 1.

Appreciate the tips...
 
As an alternative, take a look at page 218 of "Mastering Regular
Expressions" published by O'Reilly for parsing CSV files.

Here's one I used for Progress output that processed over 2 million rows in
less than 1 minute!


// Prepare regex field matcher
this.fieldRegex = new Regex(
@"(?:^|\s*) (?# Beginning of field or whitespace) " +
"(?: " +
" (?# Double quoted field) " +
" \" (?# Opening quote) " +
" ( (?> [^\"]+ | \"\")* ) (?# Not a quote or a double quote) " +
" \" (?# Closing quote) " +
" | (yes) | (no) (?# boolean ) " +
" | ([?]) (?# null) " +
" | (11/11/1111) (?# bogus init date -> 01/01/1901) " +
" | ([01][0-9]/[0-3][0-9]/[0-9][0-9][0-9][0-9]?) (?# date) " +
" | ([01][0-9]/[0-3][0-9]/[0-9][0-9]) (?# date) " +
" | (?# Remaining possible field pattern) " +
" ( [^\" ]* ) " +
")",
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
 
Dave,

Here's the entire class. It is based on a VB collection and you use it by
looping through your source file one line at a time and processing each
line.

sub Main()
dim fIn as TextReader = new TextReader(command())
dim headers() as string = split(fin.ReadLine(), vbTab)

do while not fIn.AtEndOfStream
dim line as string = fin.ReadLine
dim parsed_Line as new csvline(line, headers, vbtab)
' Parsed Line now allows keyed access to individual properties (It's
implemented as a vb collection)

' You can even create a new CVLine object to put the desired fields
into and then write it to a new file
dim pOut as new CSVLine()
pOut.add(parsed_Line("DesiredField1"))
pOut.add(parsed_Line("DesiredField2"))
pOut.add(parsed_Line("DesiredField3"))
pOut.add(parsed_Line("DesiredField4"))
debug.print pOut.ToString()
loop
fIn.Close()
end sub

============== Here's the class itself

Option Compare Text
Option Explicit On
Option Strict On

Public Class csvLine
Dim cRecs As New Collection

Public Sub New()
End Sub
Public Sub New(ByVal Line As String, ByVal Keys() As String, Optional
ByVal delim As String = ",")
Dim temp As String
Dim tKey As String
Dim i As Integer
Dim InQuotes As Boolean
Dim c As String = ""
Dim j As Integer

For i = LBound(Keys) To UBound(Keys)
InQuotes = False
temp = ""

If Len(Line) > 0 Then
c = Left$(Line, 1)
Do While Len(Line) > 0
Line = Mid$(Line, 2)
Select Case c
Case """"
InQuotes = Not InQuotes
Case delim
If Not InQuotes Then
c = ""
Exit Do
End If
End Select
temp = temp & c
c = Left$(Line, 1)
Loop
End If

' Append final character
temp = temp & c

' Remove leading and trailing Quotes
Select Case Len(temp)
Case 0
Case 1
If temp = """" Then temp = ""
If temp = delim Then temp = ""
Case Else
If Left$(temp, 1) = """" And Right$(temp, 1) = """" Then
temp = Mid$(temp, 2, Len(temp) - 2)
End Select

' Replace Double Quotes from string with Single Quotes
j = 1
Do While Len(temp) > 0 And j < Len(temp) And j > 0
j = InStr(j, temp, """""")
If j > 0 Then
temp = Left$(temp, j - 1) & Mid$(temp, j + 1)
End If
Loop

' Associate value with column name
tKey = Keys(i)
j = 0
Do While cRecs.Contains(tKey)
j = j + 1
tKey = Keys(i) & "_" & j
Loop
cRecs.Add(temp, tKey)
Next i
End Sub

Public Sub Add(ByVal obj As Object, ByVal Key As String)
cRecs.Add(obj, Key)
End Sub

Public Sub Add(ByVal obj As Object)
cRecs.Add(obj)
End Sub

Default Public ReadOnly Property Item(ByVal index As String) As String
Get
If cRecs.Contains(index) Then Return cRecs(index).ToString
'Debug.Assert(False, "Unknown index: " & index)
Return Nothing
End Get
End Property

Public Shadows Function ToString(Optional ByVal Delim As String = ",")
As String
Dim i As Integer
Dim sOut As String = ""
For i = 1 To cRecs.Count - 1
If IsNumeric(cRecs(i)) Then
sOut = sOut & Trim(cRecs(i).ToString) & Delim
Else
sOut = sOut & """" & cRecs(i).ToString & """" & Delim
End If
Next i
If IsNumeric(cRecs(i)) Then
sOut = sOut & Trim(Str(cRecs(i)))
Else
sOut = sOut & """" & cRecs(i).ToString & """"
End If
Return sOut
End Function
End Class

If this class looks suspiciously like VB6, that's because it was originally
written VB6 and ported to VB 2005. Performance has never been an issue and
I have used it on multi-megabyte files with no problem.

Mike.
 
Here's what I would do. This is the Brute Force method. :-) I am assuming
that there's no row-matching between the files, you just need to load them
in sequentially, but only the columns you want.

Read in the first file.

Parse the headings of the first line looking for the columns you need and
keep a list of the indeces for those columns. Frankly, I would put these in
an generic list because then you don't have to know how many there are. If
you know how many there are, you could use an array.

Create a datatable with those columns in it.

Read through the rest of the file and pull out the data from those columns
and store it in your datatable. Since you have the column numbers in a
list or array, you can do this with a couple of nested loops.

Do the same for the other files, *except* do not create a new datatable.
Just add any additional columns that you need that weren't in the previous
files.

If you need code samples, post again. I don't know how much info you need,
so I'm just offering a way to achieve your objective.

Robin S.
Ts'i mahnu uterna ot twan ot geifur hingts uto.
-----------------------------------------------
 
Ok, I have a very special case that I'm trying to work with here. I
can think of ways to accomplish it, but none are quick, and none seem
remotely like a best practice for this particular task.

I think I've found the best way by adapting the work of several other
people and several days of Google searching. Ironically, I almost
figured it out this way long before I posted here, but it simply has
too many quirks that make it work correctly. So it's simple, just with
many strings attached. Part of the power here is in the new
DataTable.Merge() method in framework 2.0. Basically, I use ODBC and a
generated SELECT statement to pull out only the columns I want into
its own datatable.

using System.Data.Odbc;

// Do this for each file to import, replace filename.tab with each
file's name.
String sqlStatement = "SELECT TimeStamp, AcquisitionNumber, \"Speed
(cm/s)\" FROM filename.tab";
OdbcConnection conn = new OdbcConnection("Driver={Microsoft Text
Driver (*.txt; *.csv)};dbq=E:\\data;defaultdir=E:\
\data;driverid=27;fil=text;");
conn.Open();
OdbcDataAdapter da = new OdbcDataAdapter(sqlStatement, conn);
DataTable dt = new DataTable();
da.Fill(dt);
conn.Close();

Then, I simply append each datatable to the first one I pulled out
using dt.Merge(). This gets all the data in one big table.

One caveat was needing the schema.ini file available. To accomplish
that, I simply create the file at run-time and destroy it when the
process is over. It should reside in the folder of the files you're
trying to open.

My schema.ini example:
[filename.tab]
ColNameHeader=True
Format=TabDelimited
CharacterSet=ANSI

Basically, this seems to work ok thus far. Since I know the columns I
want at run-time and am guaranteed they will be there, this seems to
work. I've not experimented much, but I'm also sure you can do this
with a typed dataset for some real power. I simply didn't know I had
the option to specify the columns I wanted to select out.

Also, if it is just two files, a simple UNION ALL in-between each
select statement gets it all at once.

Best part of all, our old Excel macro method was using up nearly a GB
of memory with a few files. By trimming out only the ones we need, I'm
only using a few MB at any time. Hopefully this will help someone else
dealing with a bunch of legacy-formatted unstructured data...

Also, thanks Michael for the CSVLine class... I've already got another
project I'm adapting that for use in (with recognition in my source
code). And thanks to Jim for the RegEx idea, that's a very interesting
approach to reading flat files...

FYI, RobinS's tip was my original fallback idea. I didn't want to go
that route due to time and potential for bugs, but it would also work
fine, and is definitely suited for using a typed datatable.
 
Dave,

Ironically, I almost
figured it out this way long before I posted here, but it simply has
too many quirks that make it work correctly.

I could see that based on the answers you gave. They were not to build up a
better approach but defending your own invention.

I think that we could give many advices on that, however I am sure that it
is not what you want.

Cor
 
¤ On Feb 5, 1:22 pm, Paul Clement
¤ > If you are reading from a tab delimited file a schema.ini file is required in order to define the
¤ > file format. The only alternative is to make a registry entry change which defines the default text
¤ > file format type. The current default would be csv.
¤
¤ Hmm, it really makes you wonder why you can't give that configuration
¤ within the DataAdapter used to get the data. Seriously, this would
¤ take me 5 lines in ruby, if that. Unfortunately, ruby doesn't have
¤ WinForms. :)
¤
¤ > If you were to place all files in a single DataTable then a Union query would be required. I don't
¤ > think that this would work if the files are in different formats.
¤
¤ That's what I was worried about. Is there any way to select out the
¤ columns I need from both into some intermediate set of structures, and
¤ then use some sort of union query on those objects to get a single
¤ DataTable? I'm guessing no, since you can't even delete columns from a
¤ DataTable.
¤

It looks like you've solved your problems but with respect to the UNION query it may be possible to
combine the data if you only select the columns you need or create column ALIAS names so that the
structure of each query maps properly. I haven't tried it but it's worth a shot and may require less
work than other methods.


Paul
~~~~
Microsoft MVP (Visual Basic)
 
Back
Top