how to spider web page with button and hyperlink

  • Thread starter Thread starter Charts
  • Start date Start date
C

Charts

I have been writing C# programs to spider yellow page to get list of
restaurant name, address to the database. When I encounter button or
hyperlink, I don’t know how to use the program to click the button or
hyperlink. Does anyone have this type of sample code in either C#, vb.net?
Thanks,
Charts
 
Hi Charts,

From your description, you're writing a custom web page spider and
wondering how to deal with button and hyperlinks appear on the page ,
correct?

Based on my understanding, web spider just retrieve the html content of web
pages and parse the elements in it. For button or hyperlinks elements, I
think they'll rely on the following facts:

1. Hyperlink is just a linker point to another external resource, so how
are you parsing the main page(use WebRequest?), you can just retrieve the
"href" location attribute from the hyperlink and use
WebRequest(sequentially or start in a new thread) to visit the linked page.

2. For Button, I think it's more complex. Depend on what does the button
do, if it just submit the page, you need to check the <form> tag's "Action"
url, and use WebRequest to visit the resource in the "Action' attribute. If
it just perform a postback (to self page) like ASP.NET, I don't think you
need to do additional work. Also, some button's click may depend on some
other entry fields on the page, it is not quite possible to cover all kinds
of page's action logic in spider code.

BTW, what component are you use to parse html content? I've used the Html
Agility Pack which is a pure .net based library and it's quite useful:

#Html Agility Pack
http://www.codeplex.com/htmlagilitypack

Here are some other good tech aritcles about writing a custom Web Spider:

#MyDownloader: A Multi-thread C# Segmented Download Manager
http://www.codeproject.com/KB/IP/MyDownloader.aspx?fid=475780&df=90&mpp=25&n
oise=3&sort=Position&view=Quick&fr=51

#A Web Spider Library in C#
http://www.codeproject.com/KB/aspnet/ZetaWebSpider.aspx

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead


Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscriptions/support/default.aspx.
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.



--------------------
 
Steven,
Your post is a great help. I'll follow up and let you know. Thanks so much.
Charts
 
Thanks for your reply Charts.

I'm glad that the information helps. If you have any further questions on
this later, please feel free to let me know.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead


Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

--------------------
From: =?Utf-8?B?Q2hhcnRz?= <[email protected]>
References: <[email protected]>
Subject: RE: how to spider web page with button and hyperlink
Date: Thu, 26 Jun 2008 08:03:01 -0700
 
Back
Top