How to parse a website in memory

  • Thread starter Thread starter Lost
  • Start date Start date
L

Lost

I have a program that scrapes a website. The site displays ever
changing numbers in the form of a table. My program constantly checks
the site to gather the new numbers and put them into an array for
processing.

The page of HTML that is received does not have carriage returns or
linefeed characters at the end of the relevant lines, but each
relevant line ends with the characters "n.l".

At the moment I process this by saving the HTML to disk as follows:

FileStr = "TempFile.tmp"
File2Str = "TempFile2.tmp"
ToF = FreeFile()
FileOpen(ToF, FileStr, OpenMode.Output)
Dim rsp As Net.HttpWebResponse = req.GetResponse
Dim strm As IO.Stream = rsp.GetResponseStream
Dim reader As New IO.StreamReader(strm)
PrintLine(ToF, reader.ReadToEnd())
reader.Close()
rsp.Close()
FileClose(ToF) 'HTML saved


The file is then read in, broken into lines and saved to another file:

FromF = FreeFile()
FileOpen(FromF, FileStr, OpenMode.Input)
ToF = FreeFile()
FileOpen(ToF, File2Str, OpenMode.Output)
Do While Not EOF(FromF)
LineStr = LineInput(FromF)
Do While InStr(LCase(LineStr), "n.l") > 0
x = InStr(LCase(LineStr), "n.l")
If x > 0 Then
PrintLine(ToF, LeftStr(LineStr, x - 1))
LineStr = Mid(LineStr, x + 3)
End If
Loop
PrintLine(ToF, LineStr)
Loop
FileClose(FromF)
FileClose(ToF)

The second file now contains the data and is read in one line at a
time and put into the array.

The routine works but it's slower than necessary because of the disk
accesses. Can someone please show me how to process the data while
it's still in memory? I'm using VB 2005 Express.
 
Thank you both for your input. I've used a combination of your ideas,
employing a memorystream to feed the lines into a string for parsing.

The original code didn't need much adjusting but I didn't get the
hoped-for speed increase over using the disk drive. That's a pity
because the updates need to happen twice per second. Oh, well.

Thanks again.
 
Back
Top