Regex and screen scraping

  • Thread starter Thread starter Chris Wertman
  • Start date Start date
C

Chris Wertman

Hello all,

Well I have to say Im getting exicted about my app , its almost there,
I have added a button to IE and am calling the current instance of IE
and grabbing th URL out just fine. Im using the webclient to grab the
html so far so good and Im only half bald.

Now I am at the point I need to extract out a couple of fields from
the HTML itself. I have read about usin regex to do this but am a
little confusedm, maybe Ive just been staring at the screen too long.

I get this HTML returned.

<b>Binding:</b> Paperback<br> <b>Publisher:</b>

What I need to extract is the word Paperback from the above string.

Here is what I have so far, I have no Idea if its right is

Dim regex As New Regex("<b>Binding:</b>((.|\n)*?)<br> <b>Publisher:",
RegexOptions.IgnoreCase)

But uhhhhhh now what do I do with that to return just the word
Paperback ?

I have several item on the same page that need to be returned, I am a
little lost about what or how I need to read it in ,do I need to put
it into StreamReader or ......well what do I do with it then.

Chris
 
Hi Chris,

When you want to do it in a Document Object Model way you can use mshtml.
You have to set a reference to it using

project->add references-> .Net -> microsoft.mshtml

Do not set an import to it, because it freezes your IDE and reference it
every time you need it.

However did you know that the newsgroup

microsoft.public.dotnet.languages.vb is much more for this kind of
questions.

I hope this helps?

Cor
 
Back
Top