VBA Read-Manipulate-Save Text File

  • Thread starter Thread starter Salmec
  • Start date Start date
S

Salmec

Hi to all,
i've a text file with more than 6.000.000 rows, i've to read the file, manipulate some rows and save the new file.

I'm doing in this way but the elaboration time is about 4 minutes:
-First I open the file for input and i count the Rows
- I close the file and open another time reading each line and put it in an String array
- While I read I manipulate the data
- Open another file in Output mode and Write all the array in the file.

Does exist an efficient mode to do what i would like to do?

Thanks in advance

Salmec
 
No need to open the file more than once! You could *dump* the file
contents into an array and work the array for the rows you need to
manipulate, then *dump* the array back into a new file. If your
manipulation adds/removes rows then you'll have to use a temp array (or
string) for the output text.

If the file has a header row in the 1st line then the number of lines =
UBound(vaText). Otherwise, the number of lines is UBound(vaText)+1. I
suggest vaText be dimmed a variant data type, not dimmed an array...

Dim vaText As Variant
vaText = Split(Read_TextFromFile(sFilename), vbCrLf)

...and when ready to put back...

Write_TextToFile Join(vaText, vbCrLf), sNewFilename

...where Read_TextFromFile is a function that takes a filename and
returns a string, and Write_TextToFile is a sub that takes the text and
the filename.

Depending on available resources, this more efficient approach should
be quite fast.

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion
 
GS said:
No need to open the file more than once! You could *dump* the file
contents into an array and work the array for the rows you need to
manipulate, then *dump* the array back into a new file. If your
manipulation adds/removes rows then you'll have to use a temp array (or
string) for the output text.

If the file has a header row in the 1st line then the number of lines =
UBound(vaText). Otherwise, the number of lines is UBound(vaText)+1. I
suggest vaText be dimmed a variant data type, not dimmed an array...

Dim vaText As Variant
vaText = Split(Read_TextFromFile(sFilename), vbCrLf)

..and when ready to put back...

Write_TextToFile Join(vaText, vbCrLf), sNewFilename

..where Read_TextFromFile is a function that takes a filename and
returns a string, and Write_TextToFile is a sub that takes the text and
the filename.

Depending on available resources, this more efficient approach should
be quite fast.

Yes, "depending on available resources" is the operative phrase here. On my
admittedly-older workstation (1GHz Athlon64, ~700MB RAM), Split() made
Excel "run off into the bushes and not come back" (bonus points for anyone
who recognizes the quote), with the added bonus that you can't break in the
middle of a Split() operation.

Depending on Salmec's specific situation, it might very well be worth it to
just skip the "Split" step and work directly on the string itself. Perhaps
you (or someone else) can improve upon my code.

Salmec, I'm assuming this is part of what you needed help with in the other
group (microsoft.public.office.developer.vba, if anyone else is interested)
so I'll just roll them together. This makes the code somewhat more complex
than before but that can't be helped.

Public Sub Parsedxf()
'***IMPORTANT!*** If there are leading or trailing spaces,
'put them inside the quotes or you won't get *any* matches.
Const LOOKFOR As String = "HATCH" & vbCrLf
'Ditto this one (sample data had 1 leading space):
Const DATAMATCH As String = vbCrLf & " 62" & vbCrLf

LOOKFORLEN = Len(LOOKFOR)
DATAMATCHLEN = Len(DATAMATCH)

Dim vaText As Variant, stmp As String
Dim sFileName As String
Dim iFileNum As Integer
Dim sBuf As String
'Don't know what the below are for;
'not going to delete them, but they serve no purpose here.
'Dim Fields As String
'Dim TempStr As String
'Dim strGenerator(0 To 3) As String
'Dim i As Integer
'i = 0

sFileName = "E:\Batch\parse.txt"
'Does the file exist?
If Len(Dir$(sFileName)) Then
iFileNum = FreeFile
Open sFileName For Binary As iFileNum
sBuf = Space$(LOF(iFileNum))
Get #iFileNum, 1, sBuf
Close iFileNum
Do 'This is where the actual work is done. Here there be tigers!
where = InStr(where + 1, sBuf, LOOKFOR)
If where Then
vatmp1 = InStr(where + LOOKFORLEN, sBuf, vbCrLf)
If vatmp1 Then
stmp = Mid$(sBuf, where + LOOKFORLEN, _
vatmp1 - where - LOOKFORLEN)
Select Case Trim$(stmp)
Case "5"
vatmp2 = InStr(vatmp1 + 2, sBuf, vbCrLf)
If vatmp2 Then
stmp = Mid$(sBuf, vatmp1 + 2, vatmp2 - vatmp1 - 2)
If Trim$(stmp) = "A7123" Then
vatmp1 = InStr(vatmp2, sBuf, DATAMATCH)
If vatmp1 Then
vatmp2 = InStr(vatmp1 + DATAMATCHLEN, sBuf, vbCrLf)
'Time to extract the data...
If vatmp2 Then
stmp = Mid$(sBuf, vatmp1 + DATAMATCHLEN, _
vatmp2 - vatmp1 - DATAMATCHLEN)
Else
stmp = Mid$(sBuf, vatmp1 + DATAMATCHLEN)
End If
ltmp = Len(stmp) 'Save this; important later.
'At this point stmp should have the data you want;
'manipulate it here. Minor example:
stmp = stmp + 1
'Then put it back into the extracted data:
Mid$(sBuf, vatmp1 + DATAMATCHLEN, ltmp) = _
Space$(ltmp - Len(stmp)) & stmp
Else
'No more instances of 62? No need to keep looking.
Exit Do
End If
End If
End If
End Select
End If
Else
Exit Do
End If
Loop
iFileNum = FreeFile
Open "outputfile" For Binary As iFileNum
Put #iFileNum, 1, sBuf
Close iFileNum
End If
End Sub

On my workstation (the above-mentioned Athlon64), this takes about 3
seconds to run, with your appropriate data being found every 36 lines.
(This will be affected by how often "HATCH" is in the actual data, how
often the rest of the data matches what you're looking for, and what kinds
of data manipulation you actually do.)
 
"Yes, "depending on available resources" is the operative phrase here.
On my admittedly-older workstation (1GHz Athlon64, ~700MB RAM), Split()
made Excel "run off into the bushes and not come back" (bonus points
for anyone who recognizes the quote), with the added bonus that you
can't break in the middle of a Split() operation."


In cases of low memory resources it's better to manage large data in
blocks. Fact is, the data is pulled from the file in a string. This can
be done in one shot or in blocks. Same goes for writing back to file.
In any case, the key element in this OP's scenario is the content of
each line. IMO, this is better handled using an array and looking for
matches in each element. That obviates having to find the next
linebreak via additional InStr() processes.


==============================
"On my workstation (the above-mentioned Athlon64), this takes about 3
seconds to run, with your appropriate data being found every 36 lines.
(This will be affected by how often "HATCH" is in the actual data, how
often the rest of the data matches what you're looking for, and what
kinds of data manipulation you actually do.)"


Well.., that's got to be a really old machine! Most systems run 1gb or
more RAM for over a decade now and so I suspect any business using
computers today has the resources available. My point was how much of
the resources is being used by other processes. If PageFile comes into
play then the process will take forever. The OP states there's 6m lines
of text and so I'd recommend processing in blocks of 50k to 100k lines
at a time *IF* the match text is in the string *before* dumping it into
an array for processing. If it's not there then grab the next block and
so on. You definitely don't want to loop the array if the match text
isn't even in the string, right?<g>

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion
 
Il giorno sabato 26 gennaio 2013 03:01:19 UTC+1, Auric__ ha scritto:
GS wrote:
[CUT...]

Thanks to all,
Expecially to Auric__, the solution proposed work perfectly and it is very fast, some second for more than 6.000.000 Rows.

Now i can edit a dxf file (CAD) directly from Excel-VBA.

King Regards

Salmec
 
Salmec said:
Il giorno sabato 26 gennaio 2013 03:01:19 UTC+1, Auric__ ha scritto:
GS wrote:
[CUT...]

Thanks to all,
Expecially to Auric__, the solution proposed work perfectly and it is
very fast, some second for more than 6.000.000 Rows.

Now i can edit a dxf file (CAD) directly from Excel-VBA.

My pleasure, but do pay attention to what Garry (GS) said; he has some valid
points.
 
Back
Top