Interactive web page archiving app: need guidance

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hi,

I need some guidance regarding the best way to build an app for archiving
web page content via an interactive browse session. I have a fair bit of
experience building ASP.NET apps using VS.NET 2005/2008 and almost no
experience building Windows forms apps, which is what this may end up being.

I have built many ASP.NET apps that scrape remote webpage content and save
them as webpages, MHTs, or PDFs. The common weakness of all of them is that
they are somewhat limited in cases where there are depencies on login,
cookies, or state.

What I would like to build would be an intranet app something like the
following:

1) The user would have what appears to be a web browser session.
2) The cookie store specific to the local machine would be accessible to the
app.
3) The user would browse around, logging in as necessary. When they saw a
web page they wanted to keep, they would press a button marked "SAVE THIS
PAGE."
4) The application would save the page to a network share and enter some
information to our database.

I believe there is a control in the realm of Windows forms apps that will
allow items #1,2 and 3. Is that correct? Can someone suggest a starting
point?

ALTERNATELY, I could speculatively imagine an asp.net application designed
as follows:

1) Some control would emulate a web browser session. The user would interact
with that, logging in as necessary.
2) Somehow the app would intercept and save the actual byte stream being
sent to the "browser"
3) Somehow the app would reassemble that byte stream into a collection of
files that could be saved.

As a learning thing I'd be really interested to know if this alternate
option can be done. As a get-the-work-done thing, I would appreciate any
guidance on the first scenario I described.

Thank you,
-KF
 
Thanks Peter. Does that control assume the "identity" of the currently
logged in user and plug into the IE cookie store for that user?
 
I've looked into this at greater length. From browsing around the net, it
looks like it is not going to be so simple a problem as you might imagine.
For what are probably valid security reasons, it is not easy to coax the
WebBrowser control into saving the page without user input. There are a lot
of schemes for grabbing the HTML/textual content of the page, but I need the
assets as well: CSS and image files.

For my purposes, it would probably be sufficient if the "File-->Save As"
prompt that was raised could be prepopulated with a filename and path that
was generated programmatically; the user could simply hit "Return" and be
done. However, it would be best if this could be completely freed of the
need for interaction by the user.
 
Hi KF,

You may want to check out following resource:

#Harvesting Web Content into MHTML Archive - The Code Project - C# Libraries
http://www.codeproject.com/cs/library/mhtmllib.asp


Regards,
Walter Wang ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
Hi Ken,

I'm writing to check the status of this post. Please feel free to let me
know if there's anything else I can help. Thanks.



Regards,
Walter Wang ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
Back
Top