Advanced? Parsing Methods

  • Thread starter Thread starter MikeC
  • Start date Start date
M

MikeC

I have developed a fully functioning sub that parses data from a CSV file
and writes new records to a table in a SQL Server 2000 database.

I am currently in the process of optimizing my code and was wondering if
there is a more efficient/advanced way to parse the data before I pass it to
my SQL string variable, called strSQL.

Would it be faster to use the Split() function to parse each line (using a
comma delimiter) and pass the parsed data to an array? Some other approach?
If so, what would be the most efficient way to do this?

Below is a code *fragment* from my sub. The CSV file has already been
opened for input. An ADO connection has also been opened. All variables
are explicitly declared. Data types are indicated in the variable names.

'Parse strSingleLine at each comma starting at intPosition 1 and moving
'to each new intPosition across each record.
strSearch = ","

'Loop until end of CSV file.
Do While Not EOF(intFileNo)
'Read line into variable. Automatically inputs next line with each
loop.
Line Input #intFileNo, strSingleLine

'Reset intPosition = 1 when starting each record.
intPosition = 1
'Set variable value.
strISOCode = Mid(strSingleLine, intPosition, _
(InStr(intPosition, strSingleLine, strSearch) -
intPosition))
'Increment intPosition.
intPosition = InStr(intPosition, strSingleLine, strSearch) + 1
'Set variable value. This variable is not used.
strCurrName = Mid(strSingleLine, intPosition, _
(InStr(intPosition, strSingleLine, strSearch) -
intPosition))
'Increment intPosition.
intPosition = InStr(intPosition, strSingleLine, strSearch) + 1
'Set variable value.
dblUSDPerUnit = CDbl(Mid(strSingleLine, intPosition, _
(InStr(intPosition, strSingleLine, strSearch) -
intPosition)))
'Increment intPosition. This variable is not currently used, but
may be later.
intPosition = InStr(intPosition, strSingleLine, strSearch) + 1
'Set variable value.
dblUnitsPerUSD = CDbl(Mid(strSingleLine, intPosition, _
(InStr(intPosition, strSingleLine, strSearch) -
intPosition)))
'Increment intPosition, but do *not* add 1 since 5 columns are
delimited by 4 commas.
intPosition = InStr(intPosition, strSingleLine, strSearch)
'Set variable value.
dtQuoteDate = CDate(Right(strSingleLine, (Len(strSingleLine) -
intPosition)))

strSQL = "INSERT INTO [w$fxrateload] (batchnumber, from_currency,
to_currency," _
& " Rate, DateValue, from_curr_id, to_curr_id, dateid,"
_
& " executionstamp, Status)" _
& " VALUES (" & intNextBatchNbr & ", '" & strISOCode &
"'" _
& ", 'USD', " & dblUnitsPerUSD & ", '" & dtQuoteDate &
"', " & 0 _
& ", " & 0 & ", " & 0 & ", '" & Now() & "', " & 0 & ")"
cnn1.Execute strSQL
Loop
 
I have developed a fully functioning sub that parses data from a CSV file
and writes new records to a table in a SQL Server 2000 database.

I am currently in the process of optimizing my code and was wondering if
there is a more efficient/advanced way to parse the data before I pass it
to my SQL string variable, called strSQL.

The above is a good question.Fact is we often deal with comma data, space
data, or all kinds of delimiters. Every developer over time will likely
build up a nice library of code. Prior to a2000, I wrote a function called

strDfield("text", "delimiter", which delimiter)

of course, you can do the same thing with split() command

So, to pass two values to a form via open args, the forms on-load event can
parse out the two values like:


parms1 = split(me.OpenArgs,"~")(0)
parms2 = split(me.OpenArgs,"~")(1)

Of couse, the above me.Open args would be some text like
"#11/11/2004#~StartDate"
So, parms1 would be a date, and the parms2 would be the field name for
example.
Would it be faster to use the Split() function to parse each line (using a
comma delimiter) and pass the parsed data to an array? Some other
approach? If so, what would be the most efficient way to do this?

I don't know if you have a performance problem here do you? Certainly
building up your own spilt routines in place of whole bunch of mids, and
instrs will make your life a zillion times easer. So, I am certainly with
you in terms of the need for developers to have a nice grab bag of routines
that parse out strings..since this is common requirement.

So, are you looking to make the coding job easer, or are you looking to
increase the performance of what you got now working?

If you are looking at a performance issued, then yes..split is going to be
quite fast....

So, while you can use:

strSomevVale = split("one,two,three,four")(1)

The above would thus results in strSomeValue being now = two

So, for handy dandy reductions of code, the split is great, and you don't
have to put the results into an array.

However, since you need to repeat "hit" the string and pull out values, then
no need to run the split over and over again..is there?

Also, looking at what you got...you likely could have used transfer text to
a temp table..and then transfer that to the server....

As for performance increase?, I would use a recordset in place of executing
inserts each time...

would certainly go:

dim vBuf as varient
dim rstTable as new ADODB.Recordset

rstTable.Open "w$fxrateload", cnn1

vBuf = split(strSingeLine,",")

You now have a variant array full of the delimited data

rstTable.AddNew
rstTable!BatchNumber = v(0)
rstTable!from_currentcy = v(1)
rstTable!to_currency = v(2)
rstTable!exectuionstamp = now()
...etc. etc. etc.
rstTable.Update

You can see the above recordset is MUCH cleaner then the sql stamntet.
Futher, it will run about 100 times faster also....
 
Thanks Albert.

See MC> below.


Albert D. Kallal said:
The above is a good question.Fact is we often deal with comma data, space
data, or all kinds of delimiters. Every developer over time will likely
build up a nice library of code. Prior to a2000, I wrote a function called

strDfield("text", "delimiter", which delimiter)

of course, you can do the same thing with split() command

So, to pass two values to a form via open args, the forms on-load event
can parse out the two values like:


parms1 = split(me.OpenArgs,"~")(0)
parms2 = split(me.OpenArgs,"~")(1)

Of couse, the above me.Open args would be some text like
"#11/11/2004#~StartDate"
So, parms1 would be a date, and the parms2 would be the field name for
example.


I don't know if you have a performance problem here do you? Certainly
building up your own spilt routines in place of whole bunch of mids, and
instrs will make your life a zillion times easer. So, I am certainly with
you in terms of the need for developers to have a nice grab bag of
routines that parse out strings..since this is common requirement.

So, are you looking to make the coding job easer, or are you looking to
increase the performance of what you got now working?
MC> I do not currently have a performance issue, but do plan to add this
code to my personal code library, so I actually want to make coding easier
and also boost the performance wherever possible. ...and to my good
fortune, you seem to have addressed both.
If you are looking at a performance issued, then yes..split is going to be
quite fast....

So, while you can use:

strSomevVale = split("one,two,three,four")(1)

The above would thus results in strSomeValue being now = two

MC> I believe the Split function will default to a space (" ") delimiter if
a delimiter is not specified, so the above should probably read as:

strSomevVale = split("one,two,three,four", ",")(1)

MC> Please let me know if I missed something. Also, I am pleasantly
surprised that the Split function can be followed by an column number, so
that I can select a specific item within the array. That makes sense since
the function returns an array.
So, for handy dandy reductions of code, the split is great, and you don't
have to put the results into an array.

However, since you need to repeat "hit" the string and pull out values,
then no need to run the split over and over again..is there?

MC> No. The top of the file contained junk, so I had to start at a certain
byte position. I just need to execute the Split once per loop. This way, I
can parse each line of the CSV file one at a time. Using this alternative,
I will no longer need to track the byte position!!! :-D
Also, looking at what you got...you likely could have used transfer text
to a temp table..and then transfer that to the server....

MC> That was the first thing I did and it worked great, but then my client
told me he needed me to port the code over to a stand alone VB executable
because he does not trust Access!!! I had to comply. :-(
As for performance increase?, I would use a recordset in place of
executing inserts each time...

would certainly go:

dim vBuf as varient
dim rstTable as new ADODB.Recordset

rstTable.Open "w$fxrateload", cnn1

vBuf = split(strSingeLine,",")

You now have a variant array full of the delimited data

rstTable.AddNew
rstTable!BatchNumber = v(0)
rstTable!from_currentcy = v(1)
rstTable!to_currency = v(2)
rstTable!exectuionstamp = now()
..etc. etc. etc.
rstTable.Update

You can see the above recordset is MUCH cleaner then the sql stamntet.
Futher, it will run about 100 times faster also....

MC> Excellent information. I'll implement the changes right now.
 
MikeC said:
Thanks Albert.

See MC> below.

You are most welcome. and, yes, you are 100% right on spotting my error in
the split example.....good stuff on your part!
MC> I believe the Split function will default to a space (" ") delimiter
if a delimiter is not specified, so the above should probably read as:

strSomevVale = split("one,two,three,four", ",")(1)

MC> Please let me know if I missed something. Also, I am pleasantly
surprised that the Split function can be followed by an column number, so
that I can select a specific item within the array. That makes sense
since the function returns an array.

no....you are right on the money....
 
Back
Top